NVIDIA's 64-bit Tegra K1: The Ghost of Transmeta Rides Again, Out of Order

Ever since NVIDIA unveiled its 64-bit Project Denver CPU at CES last year, there's been discussion over what the core might be and what kind of performance it would offer. Visibly, the chip is huge -- more than 2x the size of the Cortex-A15 that powers the 32-bit version of Tegra K1. Now we know a bit more about the core, and it's like nothing we expected. It is, however, somewhat similar to the designs we've seen in the past from the vanished CPU manufacturer Transmeta.

Project Denver, Transmeta, and 64-bit ARM


Project Denver's 64-bit flavor.

When it designed Project Denver, NVIDIA chose to step away from the out-of-order execution engine that typifies all modern high-end ARM and x86 processors. In an OoOE design, the CPU itself is responsible for deciding which code should be executed at any given cycle. OoOE chips tend to be much faster than their in-order counterparts, but the additional silicon burns power and takes up die space.

What NVIDIA has developed is an in-order architecture that relies on a dynamic optimization program (running on one of the two CPUs) to calculate and optimize the most efficient way to execute code. This data is then stored inside a special 128MB buffer of main memory.

The advantage of decoding and storing the most optimized execution method is that the chip doesn't have to decode the data a second time once needed -- it can simply grab that information from memory. Furthermore, this kind of approach may pay dividends on tablets, where users tend to use a small subset of applications. Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting -- there's no need to keep decoding it for execution over and over.



To be clear, we're not claiming Project Denver is a Transmeta rehash. Transmeta used a native VLIW (very long instruction word) architecture and a translation engine to run x86 code on a non-x86 CPU. Project Denver doesn't do this -- the entire chip is ARM-compatible start to finish. The one characteristic of a VLIW CPU that it shares is that it packs a great many execution units -- Denver can execute up to 7 instructions in a single cycle.

Denver's large L1 instruction cache (128KB, compared to 32KB for a typical Cortex-A15) is partly a nod to the need to keep more optimized instructions sitting local to the CPU. The chip also has a larger than normal L1 data cache (64KB compared to 32KB on the Cortex-A15).



Tirias Research's whitepaper also reports that the chip includes a new deeper sleep state than was available on previous chips. The new larger dies are expected to compensate for fewer total cores with the wide design and optimized performance, though NVIDIA will still have to fight an uphill marketing battle.


Click to enlarge

I don't want to depend too much on marketing benchmarks, particularly from SPEC -- the test is common, but it's also easy to optimize for in corner cases. Still, this chip is fascinating. NVIDIA has taken the parts of Transmeta's initial approach that made sense and adopted them for the modern market and the ARM ecosystem -- while pairing them with the excellent GPU performance of Tegra K1's Kepler-based solution.

The 64-bit race just got a lot more interesting.