Bulldozer Deep Dive, Conclusion
AMD still isn't talking about release dates when it comes to the company's next-generation server / workstation / enthusiast processor, but we now know much more than we did about the inner workings of the 'dozer core. This is AMD's first complete redesign since it built the K7 in 1999, and that's an inherently risky process. AMD refers to Bulldozer as a "third-way" between symmetric multithreading (Hyper-Threading) and Core Multi-Processing (multiple discrete cores on one die). Specifically, AMD started with two discrete cores, as shown in the left hand slide below, then fused them together into a single, mostly shared design. This paid off handsomely as far as design efficiency was concerned; Bulldozer's second ALU unit increased the die size by just 12 percent.
Two discrete cores on the left; AMD's Bulldozer combination on the right.
We're dubious as to whether or not AMD's approach qualifies as a "third way," perhaps it's more aptly characterized as "Hyper-Threading, Evolved." Intel's Hyper-Threading technology improves core efficiency by scheduling multiple threads for simultaneous execution. Alternatively, in a situation where the processor is waiting for code from Thread A, the scheduler is more than happy to crunch away on Thread B. This keeps the processor's execution units busier than they'd otherwise be, but standard Hyper-Threading doesn't provide the CPU with any additional execution hardware.
With Bulldozer, AMD has taken the concept of SMT and added a second independent integer unit. The following block diagram illustrates which sections of the new chip are shared and which are not. According to AMD, the company aggressively researched which core blocks needed to be duplicated and which could be combined before arriving at the present balance. We're going to circle back around to the ramifications of this decision, but let's take a closer look at the core first.
Under The Hood
Unlike all of AMD's processors since K7, Bulldozer has four x86 decoders. That puts it on par with Nehalem; previous products had just three. As we saw with Bobcat, Bulldozer's branch predication unit has been aggressively tuned for high performance as well—the branch prediction and instruction fetch logic has been decoupled, which means that an incorrect branch prediction won't stall the fetch unit (and vice versa). Again, that's different than Phenom, where these two units were tied to each other. The CPU's L1 instruction cache is 64K, but the associated data cache is much smaller. Each Bulldozer module has two independent 32K L1 data caches for a total of 96K (down from Shanghai's 128K).
If Bulldozer's FPU lives up to its promise on paper, AMD's new core could be a floating-point gorilla. The joint FPU unit is capable of tracking two hardware threads (one from each core) and has two MMX integer units and two 128-bit FMAC units. On paper, this looks more-or-less identical to Phenom II's FPU, but AMD assures us that the FPU at the heart of Bulldozer is more capable. What we do know is that Bulldozer adds support for SSE4.1, SSE4.2, and Intel's AVX extensions.
As far as the OS is concerned, each Bulldozer module will appear as a dual-core processor, just as an Intel Hyper-Threaded processor is treated as having 2x its actual number of physical cores. This leads us back to the question of exactly how many cores each Bulldozer module contains. AMD claims that one Bulldozer module delivers 80 percent of conventional dual-core performance "with much less area and power." This strikes us as decidedly optimistic and is undoubtedly highly dependent on workload. It's by no means certain that 'dozer will be able to deliver what it promises; AMD's next-generation SMT will only function optimally if the company has nailed its cache ratios, execution unit distribution, branch prediction, and available memory bandwidth.
An Uncertain Future
Designing microprocessors is like playing Russian roulette. You put a gun to your head, pull the trigger, and find out four years later if you blew your brains out.
-- Robert Palmer, former CEO of Digital Equipment Corporation.
On paper, Bulldozer looks good. It's a notable jump forward for AMD, it adds features that will help equalize the playing field against Intel, and it incorporates power-saving technologies AMD hasn't previously adopted. If some of the bets AMD has taken pay off, the company could be in a position to compete for the performance crown for the first time in years.
The problem with brand-new architectures, however, is that they often don't turn out as expected. In the past 29 years, Intel has released five separate architectures that either ultimately failed (iAPX 432, i860, Netburst), captured just a fraction of their originally intended markets (IA-64), or were repurposed for a different use (i960). In each case, the company sank years of work and vast amounts of capital into its design efforts. Palmer's quote doesn't just apply to Intel—the last thirty years are littered with the bones of once-powerful companies brought low, at least in part, by betting on the wrong microarchitectural horse.
With Bulldozer, AMD is taking two significant risks. First, the company has chosen to build its first Bulldozer processor on brand-new 32nm production lines. Most companies try to avoid this—Intel's entire Tick-Tock model was deployed after Prescott's simultaneous core debut and process shrink resulted in a power-sucking Netburst furnace as opposed to a Northwood 2.0. The danger here is that problems at the foundry level can cause significant product delays. To be fair, AMD may not have felt it had much choice, given Intel's already substantial process transition lead.
The second risk has to do with just how much of a performance boost Bulldozer can actually deliver compared to Phenom II. When AMD's Phenom core debuted in 2007, it was instantly obvious that the desktop version of Barcelona would be hard-pressed to match Intel's 65nm Conroe Core 2 parts, much less the then-new 45nm Penryn chips. Given what they had to work with, Shanghai—Phenom II—was probably the best core it could've been; it definitively surpassed Conroe and matched up well against Penryn.
Then, of course, came Core i7. AMD has been forced to compete by slashing prices and selling larger chips with higher core counts on a much smaller margin than is healthy. This has helped the company achieve price / performance parity in certain multi-threaded workloads, but Shanghai's single-core performance lags Core i7's by a significant margin. With Sandy Bridge dropping in the next few months, the gap between the two is only going to get wider.
For Bulldozer to win back any of the server / high-end market share AMD has lost in recent years, it has to arrive with a nearly perfect mixture of power efficiency, scalability, base clock frequency, and improved single-thread performance. That's a very tall order for any company; there's simply no guarantee that AMD will be able to deliver a chip that meets such high standards the first time around. We're hopeful, but cautiously optimistic.