|IDF: Inside Nehalem|
|Nehalem is the codename for Intel's next generation Core microarchitecture--which has recently been given the official processor family name of "Core i7." Nehalem was one of the big topics of discussion at IDF--and not just because it represents the next generation of Intel's processors, but also because the clock is winding down quickly on when the chip will make its official, public debut. An official date has not been given yet, but Intel is promising that we'll see Nehalem systems for sale sometime in Q4 of this year.
This article represents the culmination of several IDF presentations, privates meetings, and conversations with Intel representatives from this past week--most of the slides are from a presentation give by Rajesh Kumar, Intel Fellow and Director of Intel's Director, Circuit, & Low Power Technologies.
Intel claims that Nehalem represents the biggest platform architecture change to date. This might be true, but it is not a grounds-up, completely new architecture design. An Intel representative told us that Nehalem "shares a significant portion of the P6 gene pool"--it does not include many new instructions and has approximately the same sized pipeline as Penryn. Nehalem is built upon Penryn, but with significant architectural changes to improve performance and power efficiency. It includes more external ports and deeper buffers. Nehalem will be manufactured on a 45nm process and will be the basis for Intel's forthcoming platforms, including the desktop, server, and mobile spaces.
One of the biggest changes Intel made with Nehalem is integrating the memory controller directly into the processor (which up to now has been located in the Northbridge of Intel's chipsets). Nehalem supports native DDR3 SDRAM memory (3 channels per socket) and up to the three DIMMs per channel. This three-channel memory architecture is a radical departure from the dual-channel SDRAM memory architecture that has existed since 2003 with the introduction of Springdale. Intel claims up to a 3.4x increase in memory bandwidth from Pernryn. As DDR3 memory speeds get faster, Intel says we could potentially see up to a 6x increase in memory bandwidth. Intel also claims that memory latencies have improved by more than 40 percent.
|QuickPath, L3 cache, & Hyper-Threading|
|With the memory controller removed from the Northbridge chip, this also means no more front-side bus. Instead, Nehalem now uses a new interconnect that Intel calls "QuickPath." QuickPath is a point-to-point interconnect that Intel claims is much faster and more scalable than a front-side bus-based interconnect.
Another change for Nehalem is that it has three levels of cache--as opposed to the two levels of cache we're used to seeing on Intel's consumer-level Intel processors. There is a 64K L1 cache (32K Instruction, 32K Data), a 512K "mid-level" L2 cache (256K Data, 256K Instructions), and the shared L3 cache thats size will depend on the particular version of the processor. Intel is taking a modular approach with Nehalem's design, so it will be easier to manufacturer different versions of the chip with different features and L3 cache sizes. (The slide below titled "Core Designed for Performance" does not show the L3 cache.)
Nehalem also brings Hyper-Threading back to Intel processors. We haven't seen Hyper-Threading since the good old Pentium 4 days, save for Atom. While Hyper-Threading has in the past been criticized as being energy inefficient, Intel says the current iteration of Hyper-Threading is much more energy efficient. With Hyper-Threading, a processor with four physical cores appears to a system to have eight logical cores. Intel says it is also bringing Hyper-Threading to its Larrabee architecture.
|Power Gates, Turbo Mode, 8T SRAM, & CMOS datapaths|
|One of the big problems that Intel has run into is that as its CPU process technologies shrink, leakage power increases. Intel says that this has been a problem since 2000 with it 135nm process and has become increasingly problematic with each die shrink since. It has take a while, however, for Intel to develop a way to address this issue. Intel claims that Nehalem is very efficient at minimizing leakage power as result of using what it calls "Power Gates" in place of traditional "Clock Gates" on the transistors. In addition to reducing leakage power, Power Gates enable idle cores to sleep (C6) while other cores continue to chug away.
For the first time in an Intel processor, Nehalem adds onboard power sensors and an integrated Power Control Unit. This allows the processor to perform real-time monitoring of each core's current, power, and voltage states. One of the reasons why having onboard power controllers and an integrated Power Control Unit are so important for Nehalem is that they enable Nehalem to divert power from idle cores to active cores in what Intel calls "Turbo Mode." If a particular core's workload gets close to being saturated, it can tap into some of the power that would ordinarily be used to power one of the other cores if it is not currently in use. (Just because a processor might have four cores, this does not necessarily mean that all four cores will be simultaneously utilized. The number of cores utilized at any given moment is largely a function of how multi-threaded the various workloads are. Many of today's mainstream applications still only take advantage of one or two cores of a multi-core processor.) Obviously, there is limited additional power headroom that a given core can utilize, so the performance gains from Turbo Mode will be modest but measurable.
Another problem that comes from die process reductions is that as the processes get smaller, cores become more sensitive to high voltages and the tolerable threshold between allowable minimum and maximum voltages gets smaller. This can be problematic because if a processor doesn't get enough or too much voltage, the contents of the CPU's cache can be lost or corrupted. This results in decreased performance and errors. Intel was able to find a solution to this by upping the transistor count of Nehalem's L1 and L2 cache from the traditional six transistors (6T) per SRAM cell to eight transistors (8T) per SRAM cell--8T SRAM requires less voltage than 6T. By moving the core's L1 and L2 SRAM-based cache to an 8T-based cell design, Intel is able to better align the voltage requirements of the cache with that of the processor's low-voltage needs.
Another way Intel managed to keep the power requirements for Nehalem relatively low (130 watts TDP) was by using static CMOS for all of the chip's datapaths. CMOS is more power efficient than other datapath technologies, such as Domino or LVS (which is what the Pentium 4 used). Intel claims that Nehalem is the "first high-performance IA processor in ~20 years with a fully static CMOS datapath." CMOS is traditionally slower than Domino or LVS, so Intel had to perform some in-chip algorithm magic to compensate for the otherwise potential performance hit.
The net result of all of these architectural changes is a processor that is both powerful and power efficient. Intel's approach with Nehalem was not to make a high-performance processor as powerful as possible and then find ways to make it more power efficient. Instead, Intel took the approach of making the most power-efficient high-performance processor it could. We're looking forward to testing it for ourselves--it shouldn't be long.