The recent announcement that Larrabee has been repurposed as an HPC/scientific computing solution may therefore be partially responsible for Intel ramping up an offensive against NVIDIA's claims regarding GPU computing. At the International Symposium On Computer Architecture (ISCA) this week, a team from Intel presented a whitepaper purporting to investigate the real-world performance delta between CPUs and GPUs. From the paper's abstract:
In the past few years there have been many studies claiming GPUs deliver substantial speedups ...over multi-core CPUs...[W]e perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average.Such statements appear to have cheesed NVIDIA off; the company posted a short blog post yesterday as a public reply to Intel's allegations. In it, the general manager of GPU Computing, Andy Keane, makes a good point when he questions whether or not the CPU optimizations Intel used in its tests are indicative of real-world performance scenarios.
Intel's own paper indirectly raises this question when it notes:
The previously reported LBM number on GPUs claims 114X speedup over CPUs. However, we found that with careful multithreading, reorganization of memory access patterns, and SIMD optimizations, the performance on both CPUs and GPUs is limited by memory bandwidth and the gap is reduced to only 5X.This implies that there's been a whole lot of optimization and hand-tuning, with no guarantee that this work could be duplicated by a representative group of 'real-world' programmers using standard dev tools and compilers. Fermi cards were almost certainly unavailable when Intel commenced its project, but it's still worth noting that some of the GF100's architectural advances partially address (or at least alleviate) certain performance-limiting handicaps Intel points to when comparing Nehalem to a GT200 processor.
Snatching Bad Logic From The Jaws Of Victory
Unfortunately, Keane, having just raised legitimate points, begins unraveling his own argument. In reference to GPU vs. CPU performance he writes that "The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements...Despite substantial investments in parallel computing tools and libraries, efficient multi-core optimization remains in the realm of experts...In contrast, the CUDA parallel computing architecture from NVIDIA is a little over 3 years old and already hundreds of consumer, professional and scientific applications are seeing speedups ranging from 10 to 100x using NVIDIA GPUs."
There are two major problems with Keane's statements. First, Intel's whitepaper neither claims that parallel programming is easy for anyone, nor advances the argument that parallel programming for CPUs is easier than GPUs. This is a classic example of a straw man logical fallacy. Second, and arguably more important, is Keane's implication that optimizing code for a multicore x86 CPU requires teams of experts, while CUDA, just three years old, delivers 10-100x performance increases when CPU code is ported to run on an NVIDIA GPU. This isn't the first time we've seen NVIDIA make decidedly odd claims about parallelization, but there's no magical multicore fairy that makes all programs run faster on GPUs.
The reason we've seen such dramatic results, in some instances, is because some workloads--especially scientific workloads, which are often embarrassingly parallel--allow a GPU to process a huge number of calculations simultaneously. There are also consumer-level tasks, like video encoding, which modern GPU's are very good at. The implication that GPUs from any company are poised to kick performance into the stratosphere, however, is entirely untrue. Parallelization is damnably hard, whether you're working with a quad-core x86 CPU or a 240-core GPU; each architecture has strengths and weaknesses that make it better or worse at handling certain kinds of workloads.