|About four months ago, we covered the latest round of shin-kicking between ATI and NVIDIA, with ATI claiming that NVIDIA purposefully crippled CPU performance when running PhysX code and coerced developers to make use of it. NVIDIA denied all such claims, particularly those that implied it used its "The Way It's Meant To Be Played" program as a bludgeon to force hardware PhysX on developers or gamers.
A new report from David Kanter at Real World Technologies has dug into how PhysX is executed on a standard x86 CPU; his analysis confirms some of AMD's earlier statements. In many cases, the PhysX code that runs in a given title is both single-threaded and decidedly non-optimized. And instead of taking advantage of the SSE/SSE2 vectorization capabilities at the heart of every x86 processor sold since ~2005, PhysX calculations are done using ancient x87 instructions.
Before the introduction of SIMD sets like SSE and SSE2, if you wanted to do floating point calculations on an x86 processor, you used the x87 series of commands. In the past 11 years, however, Intel, AMD, and VIA have all three adopted SSE and SSE2. Both allow for much higher throughput than the classic x87 instruction set—given the ubiquity of support across the PC market, it's hard to tell why NVIDIA hasn't specifically mandated their use.
When in doubt, blame the PPU.
As RWT's analysis shows, however, virtually all of the applicable uops in both Cryostasis and Soft Body Physics use x87; SSE accounts for just a tiny percentage of the whole. Toss in the fact that CPU PhysX is typically single-threaded while GPU PhysX absolutely isn't, and Kanter's data suggests that NVIDIA has consciously chosen to avoid any CPU optimizations, and, in so doing, has artificially widened the gap between CPU and GPU performance. If that allegation sounds familiar, it's because we talked about it just a few weeks back, after Intel presented a whitepaper claiming that many of NVIDIA's test cases when claiming huge GPU performance advantages were unfairly optimized.
|We spoke to NVIDIA regarding the state of their PhysX SDK and why Kanter's evaluation shows so little vectorization. If you don't want to dig through all the details, the screenshot below from The Incredibles summarizes NVIDIA's response quite well.
Those of you who want a more detailed explanation, keep reading:
We're not happy, Dave. Not. Happy.
In 2004, Ageia acquired a physics middleware company named NovodeX. Back then, what we now call PhysX was a software-only solution, similar to Havok. Ageia's next step was to build a PPU (Physics Processing Unit) that could accelerate PhysX in hardware. This hardware-accelerated version of the SDK was labeled Version 2, but while it added PPU acceleration, the underlying engine was still using NovodeX code. According to the former Ageia employees still on staff at NVIDIA, NovodeX had begun building the original SDK as far back as 2002-2003.
By the time NVIDIA bought Ageia in 2008, the company had already ported PhysX to platforms like the XBox 360 and the PS3. NVIDIA's first goal was to port PhysX over to the GPU and it logically focused its development in that area. According to NVIDIA, it's done some work to improve the SDK's multithreading capabilities and general performance, but there's a limit to how much it can do to optimize an eight-year-old engine without breaking backwards compatibility.
Why The Timeline Matters:
If we accept NVIDIA's version of events, the limitations Kanter noted make more sense. Back in 2002-2003, Intel was still talking about 10GHz Pentium 4's, multi-core processors were a dim shadow on the horizon, and the a significant chunk of gamers/developers owned processors that didn't support SSE and/or SSE2.
One thing NVIDIA admitted to us when we talked to the company's PhysX team is that it's spent significantly more time optimizing PhysX to run on the XBox 360's Xenon and PS3's Cell processor as compared to the x86 platform. As far as Cell is concerned, there's good technological reasons to do so. If you hand a Cell code that's been properly tuned and tweaked, it can blow past the fastest x86 processors by an order of magnitude. If these optimizations aren't performed, however, the Broadband Engine's throughput might make you wish for a 486.
In theory, properly optimized PhysX could make the image on the left look much more like the GPU- PhysX image created on the right.
Other factors include the fact that the majority of game development is done with consoles in mind, and the simple reason that NVIDIA wants PC users to buy GPUs because of PhysX, which does make it less interested in optimizing CPU PhysX.
Modernized SDK Under Development:
It'll be awhile, but we'll eventually find out whether NVIDIA is purposefully maintaining deprecated standards, or if the problem has more to do with the age of the company's development API. NV isn't giving out any release dates, but the company is hard at work on a new version of the PhysX SDK. Rather than trying to continually patch new capabilities into an old code base, the PhysX team is "rearchitecting" the entire development platform.
In theory, this revamp will address all of the issues that have been raised regarding x86 performance, though it may still be the developer's responsibility to use and optimize certain capabilities. Even after version 3.xx is available, we'll have to wait for games that make full use of it, but if NVIDIA's been sincere, we'll see a difference in how modern CPUs perform.
|NVIDIA At A Crossroads|
|One point everyone agrees on is that NVIDIA has no obligation to AMD or Intel to optimize PhysX for CPUs and/or competing GPUs. It wouldn't surprise us if NV feels quite strongly on this point; the company has spent the last four years pushing for GPU-accelerated physics, software, and consumer applications. Whether you like the concept or not, NVIDIA was unquestionably out in front when it came to offering tools for GPU programming; we can't blame them for feeling a bit like the Little
I bought the company, I built the hype, and I wrote all the new code!
Unfortunately, NVIDIA is caught within a form of the Prisoner's Dilemma. The PD is a game theory model for describing how individuals acting in what they perceive to be their own best interest, can arrive at the worst outcome. In this case, NVIDIA wants to monetize PhysX/CUDA, and the best way to do that is to encourage developers to use them. The snag is that software developers have a very long history of only using languages and features that they know will be supported by as much hardware as is possible.
The best way to encourage people to buy NVIDIA GPUs is to ensure that the special effects are amazing and only available to NVIDIA customers. Optimizing PhysX to run on an x86 CPU potentially dilutes the attractiveness of an NVIDIA GPU, and increases the chance that customers will keep their existing cards or use a competitor's product. It could also have an impact on the company's nascent Tegra platform; NVIDIA has good reason not to optimize PhysX for Atom.
Except it's not that simple. We've already said that developers tend to support standards that a wide range of hardware can utilize, which means it is in NVIDIA's best interest to optimize PhysX for all sorts of hardware. The more platforms that run PhysX well, the more developers will use it. The better PhysX runs on the CPU...the smaller the chance that developers will go to extra effort to utilize the hardware-accelerated flavor, and the fewer consumers will opt to buy a GPU for whiz-bang special effects.
NVIDIA's claims about improvements notwithstanding, benchmarks and Kanter's investigation have confirmed that the vast majority of games that use hardware PhysX today aren't optimized for CPU execution and drop to a stuttering crawl when tasked to do so. Whose fault that is, NVIDIA's or the developer's, is still an open question. The larger meaning is that NVIDIA may soon have to choose between establishing CUDA and PhysX as a ruling standard or attempting to use them as a selling point for GPUs. Thus far, the company has tried to do both simultaneously, but we're wondering if they can do so for much longer.