February 1, 2004
For those of you
that are visual learners, this snapshot of the new Pentium 4
Prescott architecture block diagram should help put things
into perspective. What we will try to do in this
section, is break things down for you in layman's terms,
rather than ramble on with technical drivel that will likely
have you looking at the back side of your eyelids before too
long. However, this new P4 architecture is
dramatically changed enough, that it merits some discussion
Prescott Block Diagram
Click for full view
Again, the major
physical changes of the new P4 Prescott, are its 1MB of L2
Cache, 16Kb of instruction cache and deeper 31 stage
pipeline. Let's look at what some of these hardware
enhancements bring to the table in features and performance.
Pentium 4 Architecture Enhancements
Picking up where Northwood left off
Click for full view
Prescott's new deeper
pipelined core has perhaps the most significant impact on
the core's performance and future scalability. Versus
a Northwood core, the extra 11 stages in Prescott's
pipeline, will afford the processor much more headroom for
clock speeds in the future. In fact, Intel has a 4GHz
P4 on their roadmap this year, with 3.4 and 3.6GHz flavors
right around the corner in Q2. The downside of a
longer pipeline is the increased penalty that you take on a
missed branch prediction. With a pipeline that is 50+%
deeper than Northwood, the stalls when the Branch Prediction
Unit misses its target, can be crippling to performance.
Of course the trade off is that if you can scale the core
speed high enough, the inefficiencies of a deeper pipeline
become less of an issue.
Intel has been
buffing out their BPU for the Pentium 4, ever since its
first introduction in the Willamette core. Here's
where Prescott hopefully makes up some ground and avoids a
branch miss all together, since coming back through and
flushing out the pipeline is like taking the long way home.
In fact, Intel claims to have enhanced both static and
dynamic branch prediction algorithms, such that the number
of actual branch misses are significantly lower with
Prescott versus Northwood. In some cases Prescott is
more accurate with branch prediction by a factor of 2X over
Northwood, in other scenarios the benefits are minimal.
Regardless, this is another critical area for Prescott,
since clock for clock, with a deeper pipeline, the core
could in fact be slower than Northwood without these
Larger On Chip
Caches And Buffers:
Prescott's larger 1MB
L2 cache really needs no explanation, except to say that
larger cache means the core needs to go off chip less often
to fetch data from system memory while it is processing
data. The benefits of a larger L2 cache will be
exhibited especially where applications have a larger
footprint that historically needed to run from system memory
but now can reside more so in resident on chip cache.
In addition, Prescott brings a larger 16KB instruction cache
but when compared to 64K that is currently on AMD's Athlon
64, it still seems a bit smallish. Finally, Prescott
brings additional "WC Buffers" to bear for transfer across
the graphics subsystem. These "Write Combining"
buffers will assist with flow management of data across the
AGP bus, which will in turn provide more efficient use of
front side bus bandwidth. While this may seem like a
solid benefit for current AGP graphics solutions, the real
benefit will most likely come down the road, when PCI
Express based graphics cards will require more system
applications will benefit from Prescott's enhanced
Hyperthreading engine. Essentially, increased queue
sizes and the chips larger L1 data cache, will alleviate
some bottlenecks in situations where there is more than one
active thread being processed. In addition, there are
specific "context identifier" bits now available on each of
Prescott's logical processing units. This will allow
for sharing of L1 data cache by both logical processors,
thus reducing instances of contention in cache and
increasing cache hit rate during multithreaded processing.
13 New SSE3
The new SSE3
instructions that Prescott brings, will have little impact
on performance currently but will provide enhancements for
developers in the Multimedia and Gaming products.
Floating Point to Integer conversions, complex arithmetic
functions, video encoding and thread synchronization, will
be some of the special functions that can be called with
SSE3. Like the early days of SSE and SSE2, it will
take a while for adoption but expect SSE3 to have
significant impact in the future, just as Northwood does
with current SSE optimized applications today.
Nanometer (.09 micron) Manufacturing Process:
So, how do you fit
all these new enhancements on your plate and keep power,
heat, cost and defect density per wafer under control?
It's all about die size baby. As we noted earlier, a
Prescott core is less than half the size of a .13 micron
based Pentium 4 Extreme Edition at 112 square mm versus 237
square mm. Drop that svelte new core on a 300mm (think
Pizzeria sized here) wafer and the guys back in finance
begin to actually smile at the profit margins. Not to
mention, customers may actually enjoy lower retail prices!
Intel has one of the first fab lines in the world to hit the
90nm (.09 micron) mark in volume production. The
operative word is "volume" here. There are other
manufacturers like TSMC, that currently have 90nm technology
up and running but aren't quite volume ready at this point.
Prescott's launch marks another milestone for Intel and
another industry first for the chip giant.
Prescott And P4EE Vital Signs, Thermals And Overclocking