|Introduction, AMD's Bobcat|
|Yesterday, we chided AMD for its decision not to reveal more details about Bulldozer and Bobcat, but it turns out we didn't have all the facts. AMD was planning on disclosing more information later in the day at Hot Chips—but the company failed to disclose that before we went live with our previous coverage. We're going to take a look at the new information about Bobcat and Bulldozer that's subsequently been revealed; if you want more general background data, check the links above.
Bully For Bobcat
We'll start with the high-level block diagram of Bobcat's architecture, then step through some of the pertinent details. Bobcat shares certain characteristics with Atom but AMD's new low-power processor is designedcach to meet a very different set of criteria. As we've previously discussed, Bobcat is an out-of-order core—a feature it shares with all modern microprocessors. This fact alone virtually guarantees that Bobcat will outperform Atom clock-for-clock, but it also implies the chip will use more power.
Bobcat features a 64K L1 cache (32K instruction / 32K data) and 512K of L2 cache per core—a dual-core Ontario processor will feature a total of 1MB L2 cache. Bobcat's brand predictor is depicted as being "state of the art," a claim that's hard to parse without additional information. One new feature, however, is that the branch predictor shuts down whatever units aren't in use in order to reduce overall power consumption.
Bobcat's decoder (above, in red) takes a page from Intel's playbook. Like Atom, it's a dual-issue design that focuses on instruction efficiency. Modern x86 processors don't actually execute x86 instructions. Ever since the Pentium Pro, all microprocessors have translated x86 instructions in micro-ops before performing any calculations. When it designed Atom, Intel opted not to decode most x86 instructions into micro-ops and instead combined multiple instructions into single micro-ops.
AMD claims that Bobcat can "directly map 89 percent of x86 instructions to a single micro-Op, an additional 10 percent to a pair of micro-ops, and more complicated x86 instructions (<1%) are micro-coded." Intel quoted very similar figures when it unveiled Atom's decoder, but there's likely to be some subtle differences in the capabilities of the two cores.
We aren't going to go into great detail on Bobcat's ALU and FPU units (the yellow-orange and turquoise blocks), but they're structured similarly to what you'd expect to find in a higher-end core like Shanghai. When Intel built Atom, it chose to include as few execution units as possible in order to save power. AMD isn't so willing to sacrifice performance.
Bobcat's 512K L2 is 16-way set associative, ECC Protected, and uses "half-speed clocking for power reduction." It's not clear if this means the L2 cache only runs at 50 percent of core clockspeed by default, or if the cache downclocks itself when it's not much in use.
The new core's pipeline looks like this:
That's 15 stages in total—of the six fetch stages, three of them are used by the branch prediction unit. AMD is mum on the reason, citing competitive concerns. Again, this matches up to Atom's 16-stage pipeline, as do Bobcat's cache latencies. L1 cache latency is 3 cycles and L2 is 17 cycles. Finally, we've got new information on what power saving technologies AMD adopted with Bobcat.
By the time Bobcat arrives, Atom will have had the netbook market almost entirely to itself for 2.5 years. Where we once hoped VIA's Nano would introduce competition in the market, it now seems all but certain that AMD will be the first company to do so. We'd be remiss if we didn't note that Ontario actually won't compete with Atom in a large number of markets; Atom was designed specifically to scale into handheld devices and power envelopes Bobcat won't be able to reach. Where the two chips do meet, however, we expect Ontario will outperform Atom.
On the graphics side of things We'd love to think that 2011 is the year Intel will dazzle us all with a brilliant new Atom-ready GPU, but that's not likley to happen. Interestingly, we might see Intel change its tune about NVIDIA's ION if it feels AMD's future integrated solution is hitting a weak spot in Atom's armor.
Bobcat hasn't gotten as much attention as Bulldozer, but we think the low-power chip is much more likely to have an effect on AMD's bottom line and market share in the next 12 months. If Sunnyvale targets it properly, it could deliver much higher performance than the netbook market is used to at an extremely attractive price. AMD's Brazos platform won't single-handedly rejuvenate AMD's mobile division, but it could change what people expect from a netbook or a low-end notebook.
|Bulldozer Deep Dive, Conclusion|
|Brooding Over Bulldozer
AMD still isn't talking about release dates when it comes to the company's next-generation server / workstation / enthusiast processor, but we now know much more than we did about the inner workings of the 'dozer core. This is AMD's first complete redesign since it built the K7 in 1999, and that's an inherently risky process. AMD refers to Bulldozer as a "third-way" between symmetric multithreading (Hyper-Threading) and Core Multi-Processing (multiple discrete cores on one die). Specifically, AMD started with two discrete cores, as shown in the left hand slide below, then fused them together into a single, mostly shared design. This paid off handsomely as far as design efficiency was concerned; Bulldozer's second ALU unit increased the die size by just 12 percent.
Two discrete cores on the left; AMD's Bulldozer combination on the right.
We're dubious as to whether or not AMD's approach qualifies as a "third way," perhaps it's more aptly characterized as "Hyper-Threading, Evolved." Intel's Hyper-Threading technology improves core efficiency by scheduling multiple threads for simultaneous execution. Alternatively, in a situation where the processor is waiting for code from Thread A, the scheduler is more than happy to crunch away on Thread B. This keeps the processor's execution units busier than they'd otherwise be, but standard Hyper-Threading doesn't provide the CPU with any additional execution hardware.
With Bulldozer, AMD has taken the concept of SMT and added a second independent integer unit. The following block diagram illustrates which sections of the new chip are shared and which are not. According to AMD, the company aggressively researched which core blocks needed to be duplicated and which could be combined before arriving at the present balance. We're going to circle back around to the ramifications of this decision, but let's take a closer look at the core first.
Under The Hood
Unlike all of AMD's processors since K7, Bulldozer has four x86 decoders. That puts it on par with Nehalem; previous products had just three. As we saw with Bobcat, Bulldozer's branch predication unit has been aggressively tuned for high performance as well—the branch prediction and instruction fetch logic has been decoupled, which means that an incorrect branch prediction won't stall the fetch unit (and vice versa). Again, that's different than Phenom, where these two units were tied to each other. The CPU's L1 instruction cache is 64K, but the associated data cache is much smaller. Each Bulldozer module has two independent 32K L1 data caches for a total of 96K (down from Shanghai's 128K).
If Bulldozer's FPU lives up to its promise on paper, AMD's new core could be a floating-point gorilla. The joint FPU unit is capable of tracking two hardware threads (one from each core) and has two MMX integer units and two 128-bit FMAC units. On paper, this looks more-or-less identical to Phenom II's FPU, but AMD assures us that the FPU at the heart of Bulldozer is more capable. What we do know is that Bulldozer adds support for SSE4.1, SSE4.2, and Intel's AVX extensions.
As far as the OS is concerned, each Bulldozer module will appear as a dual-core processor, just as an Intel Hyper-Threaded processor is treated as having 2x its actual number of physical cores. This leads us back to the question of exactly how many cores each Bulldozer module contains. AMD claims that one Bulldozer module delivers 80 percent of conventional dual-core performance "with much less area and power." This strikes us as decidedly optimistic and is undoubtedly highly dependent on workload. It's by no means certain that 'dozer will be able to deliver what it promises; AMD's next-generation SMT will only function optimally if the company has nailed its cache ratios, execution unit distribution, branch prediction, and available memory bandwidth.
An Uncertain Future
Designing microprocessors is like playing Russian roulette. You put a gun to your head, pull the trigger, and find out four years later if you blew your brains out.
-- Robert Palmer, former CEO of Digital Equipment Corporation.
On paper, Bulldozer looks good. It's a notable jump forward for AMD, it adds features that will help equalize the playing field against Intel, and it incorporates power-saving technologies AMD hasn't previously adopted. If some of the bets AMD has taken pay off, the company could be in a position to compete for the performance crown for the first time in years.
The problem with brand-new architectures, however, is that they often don't turn out as expected. In the past 29 years, Intel has released five separate architectures that either ultimately failed (iAPX 432, i860, Netburst), captured just a fraction of their originally intended markets (IA-64), or were repurposed for a different use (i960). In each case, the company sank years of work and vast amounts of capital into its design efforts. Palmer's quote doesn't just apply to Intel—the last thirty years are littered with the bones of once-powerful companies brought low, at least in part, by betting on the wrong microarchitectural horse.
With Bulldozer, AMD is taking two significant risks. First, the company has chosen to build its first Bulldozer processor on brand-new 32nm production lines. Most companies try to avoid this—Intel's entire Tick-Tock model was deployed after Prescott's simultaneous core debut and process shrink resulted in a power-sucking Netburst furnace as opposed to a Northwood 2.0. The danger here is that problems at the foundry level can cause significant product delays. To be fair, AMD may not have felt it had much choice, given Intel's already substantial process transition lead.
The second risk has to do with just how much of a performance boost Bulldozer can actually deliver compared to Phenom II. When AMD's Phenom core debuted in 2007, it was instantly obvious that the desktop version of Barcelona would be hard-pressed to match Intel's 65nm Conroe Core 2 parts, much less the then-new 45nm Penryn chips. Given what they had to work with, Shanghai—Phenom II—was probably the best core it could've been; it definitively surpassed Conroe and matched up well against Penryn.
Then, of course, came Core i7. AMD has been forced to compete by slashing prices and selling larger chips with higher core counts on a much smaller margin than is healthy. This has helped the company achieve price / performance parity in certain multi-threaded workloads, but Shanghai's single-core performance lags Core i7's by a significant margin. With Sandy Bridge dropping in the next few months, the gap between the two is only going to get wider.
For Bulldozer to win back any of the server / high-end market share AMD has lost in recent years, it has to arrive with a nearly perfect mixture of power efficiency, scalability, base clock frequency, and improved single-thread performance. That's a very tall order for any company; there's simply no guarantee that AMD will be able to deliver a chip that meets such high standards the first time around. We're hopeful, but cautiously optimistic.