NVIDIA's 64-bit Tegra K1: The Ghost of Transmeta Rides Again, Out of Order
Project Denver, Transmeta, and 64-bit ARM
Project Denver's 64-bit flavor.
When it designed Project Denver, NVIDIA chose to step away from the out-of-order execution engine that typifies all modern high-end ARM and x86 processors. In an OoOE design, the CPU itself is responsible for deciding which code should be executed at any given cycle. OoOE chips tend to be much faster than their in-order counterparts, but the additional silicon burns power and takes up die space.
What NVIDIA has developed is an in-order architecture that relies on a dynamic optimization program (running on one of the two CPUs) to calculate and optimize the most efficient way to execute code. This data is then stored inside a special 128MB buffer of main memory.
The advantage of decoding and storing the most optimized execution method is that the chip doesn't have to decode the data a second time once needed -- it can simply grab that information from memory. Furthermore, this kind of approach may pay dividends on tablets, where users tend to use a small subset of applications. Once Denver sees you run Facebook or Candy Crush a few times, it's got the code optimized and waiting -- there's no need to keep decoding it for execution over and over.
To be clear, we're not claiming Project Denver is a Transmeta rehash. Transmeta used a native VLIW (very long instruction word) architecture and a translation engine to run x86 code on a non-x86 CPU. Project Denver doesn't do this -- the entire chip is ARM-compatible start to finish. The one characteristic of a VLIW CPU that it shares is that it packs a great many execution units -- Denver can execute up to 7 instructions in a single cycle.
Denver's large L1 instruction cache (128KB, compared to 32KB for a typical Cortex-A15) is partly a nod to the need to keep more optimized instructions sitting local to the CPU. The chip also has a larger than normal L1 data cache (64KB compared to 32KB on the Cortex-A15).
Tirias Research's whitepaper also reports that the chip includes a new deeper sleep state than was available on previous chips. The new larger dies are expected to compensate for fewer total cores with the wide design and optimized performance, though NVIDIA will still have to fight an uphill marketing battle.
I don't want to depend too much on marketing benchmarks, particularly from SPEC -- the test is common, but it's also easy to optimize for in corner cases. Still, this chip is fascinating. NVIDIA has taken the parts of Transmeta's initial approach that made sense and adopted them for the modern market and the ARM ecosystem -- while pairing them with the excellent GPU performance of Tegra K1's Kepler-based solution.
The 64-bit race just got a lot more interesting.