AMD's Next Gen Steamroller CPU Could Deliver Where Bulldozer Fell Short

Today at the Hot Chips Symposium, AMD's CTO Mark Papermaster is taking the wraps off AMD's upcoming CPU core, codenamed Steamroller. Steamroller is the third iteration of Sunnyvale's Bulldozer architecture and an extremely important part. Bulldozer, launched just over a year ago, was a major disappointment. The company's second-generation Bulldozer implementation, codenamed Piledriver, made a number of important changes and was incorporated into the Trinity APU family that debuted last spring.

Steamroller is the first refresh of Bulldozer's underlying architecture and may finally deliver the sort of performance and efficiency AMD was aiming for when it built 'Dozer in the first place. In the slides below, all of the comparisons and percentage gains are based on Trinity.

With Steamroller, AMD is taking a baby step or two back towards the traditional dual-core model. Here's Bulldozer's Fetch/Decode/Dispatch hardware, as compared to Steamroller's.


Bulldozer and Steamroller Fetch and Decode Architecture

One of Bulldozer's limitations was that it could only decode four instructions per module for a maximum of 16 instructions per clock in a four module / eight core configuration.That put the chip at a theoretical disadvantage compared to Istanbul (3 instructions/core, 18 total in a six-core configuration) and Sandy Bridge (4 instructions/core, 32 total in an eight-core CPU).

It's not clear if Steamroller can actually dispatch more instructions per clock, but a pair of dual-issue dispatch units may be quicker than the single, unified logic block Bulldozer used. BD's unified approach reduced multithreading performance by ~20% compared to a traditional dual-core. Given how much logic the chip shared, a 20% performance penalty isn't bad -- but reducing this penalty is a great place for AMD to recover performance.



Johan DeGelas' excellent in-depth article on Interlagos performance revealed that the L1 instruction cache had taken a nasty efficiency hit compared to older Istanbul-based chips. With both cores per module enabled, L1 hitrate had fallen to 95% from 97% (the mispredict rate nearly doubled, in other words).  AMD is "increasing" L1 instruction cache size to compensate -- presumably to 96-128K per module, from 64K in Bulldozer. A 30% reduction in i-cache misses would put the L1 hit rate back in 96-97% territory.

Steamroller L1 Cache and Integer Scheduler Improvements
Steamroller L1 Cache and Integer Scheduler Improvements

Interlagos' branch predictor was better than Istanbul's, but still significantly worse than Intel's. A 20% improvement here won't put AMD and Intel on equal footing, but it will boost Bulldozer's overall performance.



It's not clear what AMD means by "streamlined execution hardware." Typically that's execu-speak for "We got rid of some stuff," but that may not be a problem here. Sunnyvale is pushing the idea that the GPU effectively becomes the floating-point heavy lifter at some point in the not-too-distant future, and strong FPU performance isn't really driving adoption in any segments where AMD can reasonably expect to compete.

Putting It All Together:

Based on what we know now, Steamroller looks a lot like the CPU Bulldozer should've been. AMD is claiming a 15% performance/watt improvement, and that figure makes sense given what we've seen today. The good news is that another 15% definitely moves things forward for AMD. Trinity's major achievement was its ability to deliver Llano-equivalent performance at moderately less power; Steamroller should finally pull ahead of the old K10 architecture in clock-to-clock efficiency. That's critical -- AMD needs to strengthen its single-thread performance if it wants to compete with Intel in mobile markets.

The downside is that another 15% won't really change competitive positioning. Steamroller's raw performance may match Sandy Bridge, but it's unlikely to compete well against IVB or Haswell. This suggests that AMD's ability to gain share in mobile will continue to be performance-constrained. With that said, Steamroller is still hugely important -- it's shaping up to be the first real example of what AMD wanted to accomplish when it opted for CMT (Chip Multi-Threading) architecture.



Timing will be critical. Sunnyvale hasn't said when it expects Steamroller to ship beyond a broad "2013" target; an early launch window is infinitely preferable to allowing the core to slip into the back half of 2013. Right now, AMD has made no statements on the Kaveri SoC's launch timeframe (Kaveri is the first APU to integrate a Steamroller core). Sunnyvale's last public roadmap update, last February, indicates that Steamroller won't launch in an independent CPU flavor -- at least not in 2013. The Piledriver core at the heart of Trinity remains the top product in the company's lineup.

If it can launch ahead of Haswell, AMD has a chance to focus the conversation on its cycle of continuous, rapid improvements rather than being defined as an Intel also-ran. Hopefully we'll be able to glean more information from the company's presentations and whitepapers at Hot Chips, but Steamroller is a strong start.