Intel Tremont CPU Microarchitecture: Power Efficient, High-Performance x86
Intel Tremont Microarchitecture - Targeting Clients, Data Centers, 5G Networking, and IoT
Late last year, at its Architecture Day event, Intel revealed a new, low-power microarchitecture, codenamed Tremont, that would power and array of processors and SoCs targeting products across the client, data center, 5G networking, and Internet of Things markets. While Intel did disclose the codename and show-off a Foveros-based SoC featuring Tremont -- codenamed Lakefield -- it did not dive deep on the microarchitecture or discuss its inner-workings.
Today, however, at the Linley Fall Processor Conference that’s currently underway, Intel discussed Tremont in-depth and revealed its main features, microarchitectural enhancements, new instructions, and expected performance levels.
Intel's Tremont Architecture Is A Significant Departure From Its Predecessors
Tremont is a low-power, 10nm x86 microarchitecture that is the successor to Goldmont Plus, which is used on current Atom, Pentium Silver, and some Celeron series processors. Tremont is destined for compact, low-power packages and incorporates a number of updates to the ISA, enhanced security features, more advanced power management, and it delivers significant IPC (Instructions per Cycle) improvements gen-over-gen versus Intel’s current low power x86 architectures.Tremont is also significant departure from Goldmont Plus and its predecessors. Tremont features an Intel Core-Class branch predictor, with 6-wide out-of-order instruction decoder on the front end, with 4 wide allocation, 10 execution ports on the back-end, and dual load and store pipelines. Tremont is designed with up to quad-cores in mind, with up to 4.5MB of L2 cache, but the actual cache configuration will be dependent on the specific product design.
Branch prediction in Tremont has long history support and is 32 byte based. The L1 predictor has no branch penalty and the L2 predictor is larger than previous-gen products. The fetcher features a 32KB instruction cache (32 bytes per cycle) which can handle up to 8 outstanding misses and still allow the processor to continue executing instructions.
The 6-wide x86 instruction decoder in Tremont is split into dual, 3-wide clusters. As mentioned, it is an out-of-order design with wide decode support, without using area for a uOP cache. The design can also be scaled back to a single, 3-wide setup, depending on the target product’s design.
Tremont has an out-of-order window >200 (208, specifically), with 6 parallel reservation stations. It features 3 ALUs, 2 AGUs, 1 jump port, and 1 store data port.
There are also dual 128b AES units present in Tremont, which can handle a single SHA256 instruction for encryption workloads in only 4 cycles. Galois Field (GF) new instruction support is present as well. There are two parallel reservation stations in the design, with three execution ports.
Tremont features dual load/store pipelines, with 32KBof data cache (3-cycle), and a 1024 entry second level translation lookaside buffer.
Tremont's L2 cache is shared across the cores and can scale from 1.5MB on up to 4.5MB. There is also Last Level cache support built in, which is a first for Intel’s low-power designs, though it will not be implemented in every design. Intel Resource Directory Technology incorporates QoS for the L2 and LLC to optimize performance and use of bandwidth.