ARM Project Trillium NPU Details
After taking a measured, wait-and-see approach to machine learning, Arm is ready to jump into the fold full throttle. Today, the company is following up on its Project Trillium announcement from back in February with new details surrounding its machine learning processor, also known as the neural-network processing unit or NPU.
To quickly recap, Project Trillium encompasses Arm’s push to leverage machine learning and neural network operations across their line-up of processors – whether that processing takes place on the CPU, GPU, or its new NPU.
Arm’s goal is to make machine learning feasible “on the edge.” The edge largely describes standalone devices such as smartphones and other mobile devices, as opposed to a centralized datacenter approach. Moving ML workloads to the edge brings tangible benefits in the form of reduced bandwidth requirements, decreased power demands, and the associated lower costs. It also removes a significant amount of latency from the equation as data no longer needs to leave the device. Keeping data on the device can also improve reliability and security.
Neural network workloads are computationally expensive on traditional CPU and GPU architectures, however. This is not to say CPUs and GPUs are unfit for these roles in all circumstances, but many ML heavy processes stand to benefit significantly from a dedicated, purpose-built processor.
Arm’s machine learning processor is built upon a brand-new architecture for neural networks. Arm is setting its sights on mobile first, but the architecture is designed to be highly scalable, and will eventually span devices from the Internet of Things to the Datacenter.
At a high level, Arm’s machine learning processor consists of a micro-controller and direct memory access (DMA) engine to oversee scheduling on the neural network, while up to 16 compute engines tackle the actual processing workloads.
Arm’s design goals manifest in its key architectural features: efficient convolution, efficient data movement, and sufficient programmability. Arm targets power and area metrics for efficiency, as both aspects are paramount to the many constraints of mobile form factors. As a result, Arm’s convolution computations are centered around 8-bit datatypes that have emerged as a standard in machine learning applications.
Let's dive a little deeper and discuss some additional features next...