IDF: Inside Nehalem
For the first time in an Intel processor, Nehalem adds onboard power sensors and an integrated Power Control Unit. This allows the processor to perform real-time monitoring of each core's current, power, and voltage states. One of the reasons why having onboard power controllers and an integrated Power Control Unit are so important for Nehalem is that they enable Nehalem to divert power from idle cores to active cores in what Intel calls "Turbo Mode." If a particular core's workload gets close to being saturated, it can tap into some of the power that would ordinarily be used to power one of the other cores if it is not currently in use. (Just because a processor might have four cores, this does not necessarily mean that all four cores will be simultaneously utilized. The number of cores utilized at any given moment is largely a function of how multi-threaded the various workloads are. Many of today's mainstream applications still only take advantage of one or two cores of a multi-core processor.) Obviously, there is limited additional power headroom that a given core can utilize, so the performance gains from Turbo Mode will be modest but measurable.
Another problem that comes from die process reductions is that as the processes get smaller, cores become more sensitive to high voltages and the tolerable threshold between allowable minimum and maximum voltages gets smaller. This can be problematic because if a processor doesn't get enough or too much voltage, the contents of the CPU's cache can be lost or corrupted. This results in decreased performance and errors. Intel was able to find a solution to this by upping the transistor count of Nehalem's L1 and L2 cache from the traditional six transistors (6T) per SRAM cell to eight transistors (8T) per SRAM cell--8T SRAM requires less voltage than 6T. By moving the core's L1 and L2 SRAM-based cache to an 8T-based cell design, Intel is able to better align the voltage requirements of the cache with that of the processor's low-voltage needs.
Another way Intel managed to keep the power requirements for Nehalem relatively low (130 watts TDP) was by using static CMOS for all of the chip's datapaths. CMOS is more power efficient than other datapath technologies, such as Domino or LVS (which is what the Pentium 4 used). Intel claims that Nehalem is the "first high-performance IA processor in ~20 years with a fully static CMOS datapath." CMOS is traditionally slower than Domino or LVS, so Intel had to perform some in-chip algorithm magic to compensate for the otherwise potential performance hit.
The net result of all of these architectural changes is a processor that is both powerful and power efficient. Intel's approach with Nehalem was not to make a high-performance processor as powerful as possible and then find ways to make it more power efficient. Instead, Intel took the approach of making the most power-efficient high-performance processor it could. We're looking forward to testing it for ourselves--it shouldn't be long.