Intel, NVIDIA Slug It Out Over CPU vs GPU Performance

rated by 0 users
This post has 18 Replies | 2 Followers

Top 10 Contributor
Posts 24,884
Points 1,116,875
Joined: Sep 2007
ForumsAdministrator
News Posted: Fri, Jun 25 2010 2:55 PM
Over the past four years, NVIDIA has made a great many claims regarding how porting various types of applications to run on GPUs instead of CPUs can tremendously improve performance by anywhere from 10x-500x. Intel, unsurprisingly, sees the situation differently, but has remained relatively quiet on the issue, possibly because Larrabee was going to be positioned as a discrete GPU.

The recent announcement that Larrabee has been repurposed as an HPC/scientific computing solution may therefore be partially responsible for Intel ramping up an offensive against NVIDIA's claims regarding GPU computing. At the International Symposium On Computer Architecture (ISCA) this week, a team from Intel presented a whitepaper purporting to investigate the real-world performance delta between CPUs and GPUs. From the paper's abstract:
In the past few years there have been many studies claiming GPUs deliver substantial speedups ...over multi-core CPUs...Wilted Flowere perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average.
Such statements appear to have cheesed NVIDIA off; the company posted a short blog post yesterday as a public reply to Intel's allegations. In it, the general manager of GPU Computing, Andy Keane, makes a good point when he questions whether or not the CPU optimizations Intel used in its tests are indicative of real-world performance scenarios.

Intel's own paper indirectly raises this question when it notes:
The previously reported LBM number on GPUs claims 114X speedup over CPUs. However, we found that with careful multithreading, reorganization of memory access patterns, and SIMD optimizations, the performance on both CPUs and GPUs is limited by memory bandwidth and the gap is reduced to only 5X.
This implies that there's been a whole lot of optimization and hand-tuning, with no guarantee that this work could be duplicated by a representative group of 'real-world' programmers using standard dev tools and compilers. Fermi cards were almost certainly unavailable when Intel commenced its project, but it's still worth noting that some of the GF100's architectural advances partially address (or at least alleviate) certain performance-limiting handicaps Intel points to when comparing Nehalem to a GT200 processor.

Snatching Bad Logic From The Jaws Of Victory

Unfortunately, Keane, having just raised legitimate points, begins unraveling his own argument. In reference to GPU vs. CPU performance he writes that "The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements...Despite substantial investments in parallel computing tools and libraries, efficient multi-core optimization remains in the realm of experts...In contrast, the CUDA parallel computing architecture from NVIDIA is a little over 3 years old and already hundreds of consumer, professional  and scientific applications are seeing speedups ranging from 10 to 100x using NVIDIA GPUs."

There are two major problems with Keane's statements. First, Intel's whitepaper neither claims that parallel programming is easy for anyone, nor advances the argument that parallel programming for CPUs is easier than GPUs. This is a classic example of a straw man logical fallacy. Second, and arguably more important, is Keane's implication that optimizing code for a multicore x86 CPU requires teams of experts, while CUDA, just three years old, delivers 10-100x performance increases when CPU code is ported to run on an NVIDIA GPU. This isn't the first time we've seen NVIDIA make decidedly odd claims about parallelization, but there's no magical multicore fairy that makes all programs run faster on GPUs.

The reason we've seen such dramatic results, in some instances, is because some workloads--especially scientific workloads, which are often embarrassingly parallel--allow a GPU to process a huge number of calculations simultaneously. There are also consumer-level tasks, like video encoding, which modern GPU's are very good at. The implication that GPUs from any company are poised to kick performance into the stratosphere, however, is entirely untrue. Parallelization is damnably hard, whether you're working with a quad-core x86 CPU or a 240-core GPU; each architecture has strengths and weaknesses that make it better or worse at handling certain kinds of workloads.
  • | Post Points: 200
Top 10 Contributor
Posts 8,439
Points 102,180
Joined: Apr 2009
Location: Shenandoah Valley, Virginia
MembershipAdministrator
Moderator
realneil replied on Sat, Jun 26 2010 9:51 AM

I just want a computer that works well. I want one that plays the games that I like and can afford as smoothly as possible. I have a NVIDIA Card solution and also an ATI card solution. Coupled with the i7 and the i5 CPU's I'm using, I get just that, smooth power without any hickups.

I built an AMD Phenom-II X3-720 based system with a Radeon 5670 card in it for not much cash at all and it's a solid performer as well. If it had a high powered Video card in it like the other two systems, it would be on a par with both of the Intel boxes. AMD solutions are nothing to sneeze at and they're very affordable too.

These two companies can see who's stream is golden and whether or not it's farther than the others all day long. I could care less.

Everything that's happened in the past few years has translated into some very nice computers being available to us at decent prices. And I'm glad of that.

Don't part with your illusions. When they are gone you may still exist, but you have ceased to live.

(Mark Twain)

  • | Post Points: 20
Top 100 Contributor
Posts 1,038
Points 11,350
Joined: Jul 2009
Joel H replied on Sat, Jun 26 2010 8:20 PM

Real, that's not very forward looking of you! Bad consumer! Bad consumer! *spankspank*

  • | Post Points: 5
Top 25 Contributor
Posts 3,477
Points 53,915
Joined: Jul 2004
Location: United States, Massachusetts
ForumsAdministrator
MembershipAdministrator
Dave_HH replied on Sat, Jun 26 2010 9:38 PM

Give it up for the Scarecrow though here, come on! :)

Editor In Chief
http://hothardware.com


  • | Post Points: 20
Top 10 Contributor
Posts 6,371
Points 80,285
Joined: Nov 2004
Location: United States, Arizona
Moderator

I like the flower in the first quote. When you view the article from the main page it does not show up. But in the forum it does...lol .. Wilted Flower 

"Never trust a computer you can't throw out a window."

2700K

Z77 GIGABYTE G1.SNIPER

GIGABYTE GTX670

G.Skill Ripjaws X 16gb PC2133

Antec P280

Corsair H100

Asus Blu-ray burner

Seasonic X650 PSU

Patriot Pyro 128gb SSD

  • | Post Points: 5
Not Ranked
Posts 1
Points 5
Joined: Jun 2010
KKumar replied on Sun, Jun 27 2010 11:37 AM

Intel may offer parallel performance .... but at what price? That's the only real consideration.

  • | Post Points: 5
Top 100 Contributor
Posts 1,038
Points 11,350
Joined: Jul 2009
Joel H replied on Sun, Jun 27 2010 12:04 PM

KKumar,

I'm not sure what you mean, especially if we view the situation in historical context. Six years ago, a dual-socket motherboard was easily $450-$500; the Tyan S2895 workstation board I used for several years was, IIRC, a $1200 board. Quad-CPU motherboards were even more expensive--think $2000-$4000. The CPUs that ran in these boards also commanded massive premiums; even after AMD entered the market. A dual AMD Opteron board+ 2 CPUs might be $2000-$3500; a Xeon MP configuration might run upwards of $5K.

Some of these initial figures may be a bit off, but consider them in context with modern prices. A *nice* 890GX board from Asus is $140. A quad-core AMD Phenom 955 is $159; a six-core 3.2GHz 1090T is $295. Intel's prices aren't quite so nice, but the quad-core i7 is $279 while a solid i7 motherboard is ~$200.

AMD's ratios are better, but even Intel's prices are less than 10% of what they were five years ago at the quad-core level. Cores have become insanely cheap as the amount of supporting circuitry and hardware needed to use them has shrunk.

  • | Post Points: 5
Top 10 Contributor
Posts 8,439
Points 102,180
Joined: Apr 2009
Location: Shenandoah Valley, Virginia
MembershipAdministrator
Moderator
realneil replied on Sun, Jun 27 2010 8:37 PM

Yeah Joel, what I said.

Good shtuff, less money.

wait long time to upchuck more cash.

Good enough!

Don't part with your illusions. When they are gone you may still exist, but you have ceased to live.

(Mark Twain)

  • | Post Points: 5
Not Ranked
Posts 2
Points 25
Joined: Jun 2010
kisai replied on Sun, Jun 27 2010 9:02 PM

There is a 120x speed up from a [single core of a] core 2 duo to a 500Mhz ATI 5750 (400 stream processors)

That result can be replicated with the BOINC project "Collatz Conjecture" which is already optimized for CPU's and GPU's proprietary API's.

Try porting something serially constrained like zlib to a GPU, you'll quickly find that you you'll have to break backwards compatibility or sacrifice some of the compression gains in order to use it.

If Intel want's developers to use their CPU's in an optimized manner to compete with GPU's they should provide source-code (C/ASM) for their x86/x86-64 versions for such things the CPU excels at. Otherwise they have no reason to complain that developers don't know how to program their processors efficiently.

  • | Post Points: 20
Not Ranked
Posts 1
Points 5
Joined: Jun 2010
dkrnic replied on Mon, Jun 28 2010 3:35 AM

Divide and conquer is the method, just like in Iraq and Afpak.

 

zlib IS serially constrained but not severely. Pathologically compressible patterns DO suffer from being split up into smaller chunks before being compressed (on account of redundant dictionaries), but ordinary data forego only a few percents of theoretical size reduction of a given algorithm. However, the space is not wasted - slicing into conveniently sized chunks simplifies recovery from data corruption while at the same time trivializing navigation in compressed material, obviously something to recommend it for even if your actual constraint is hardware, e.g. a lonesome single-core. More can be gained by using tighter deflators, which may be prohibitively CPU-intensive for a Wintel, than can be lost by slicing and splicing across massively parallel shaders of a humble GPU.

 

As an admin I'm a big fan of compressions. If an AMD HD1150 or an nVidia GTX8200WTF could give me "only" 2.5 times more compression per second than the 24 cores of an i9 Intel HPC I'd choose half a dozen GPUs and an AM3 Thuban without blinking. Given that Intel compared apples and oranges - its latest in the EXTREME series with a rather modest model of the competition - I'd probably end up with much more speed gain anyways. Even with 500 times more or more the compression is far from being memory bandwidth-constrained.

  • | Post Points: 5
Top 100 Contributor
Posts 1,038
Points 11,350
Joined: Jul 2009
Joel H replied on Mon, Jun 28 2010 12:20 PM

Kisai,

Just how fast is the C2D? I think it's important to note what the exact clockspeeds are when making these sorts of comparisons so as to not litter the floor with a bunch of data that only applies in certain cases.

Even if we assume perfectly linear scaling by core and by clockspeed, obviously a C2D won't catch a 5750--but it's possible that the normalized comparison between the two is significantly lower than 120x once these factors are adjusted for.

  • | Post Points: 5
Not Ranked
Posts 2
Points 10
Joined: Jun 2010

...registered just to comment.

As a grad student with a background in rendering (offline and real-time) and mathematics, anytime I sit down to write computationally intensive code I do so using DirectX knowing that my GPU will kick my CPU to the curb every time. This was an annoying chore with DirectX 8 and 9 class hardware but with DirectX 10 and especially 11 it's a real joy. Finite element methods, fluid dynamics, signal processing, anything that MathCAD or Maple does and so on are all things that GPUs excell at.

Speed improvements using CUDA are not as dramatic as simply using DirectX or OpenGL directly and that's probably part of the problem. Getting a university's resident math and physics geeks to stop using Fortran and start using C/C++ and CUDA is hard enough, getting them to use an alien API built specifically for graphics in order to see real improvements is basically a religious debate (ie not possible). I suppose that's understandable; people want to spend time doing research / solving problems not learning some obscure and endlessly changing API.

I can't really speak for 'real' scientific applications designed to run on massive mainframes but I'm inclined to believe that GPUs will have something to offer in the near future. It's a bummer that the best way to leverage a GPU is through DirectX and therefore not applicable to industrial strength computing environments. There's also the issue of the IEEE floating point standard. When it comes to IEEE compliance my GeForce 8800 loves to make unpredictable and seemingly random deviations; run the same code on a different card and you'll get a different set of anomalies. If you need double precision floats then you have to bypass the GPU altogether. If ATI and Nivida can address these problems and a lot more people put the time in to understand how your average GPU works the price vs performance ratio of scientific computing environments stand to take a huge drop and that's always a good thing.

  • | Post Points: 5
Top 100 Contributor
Posts 1,038
Points 11,350
Joined: Jul 2009
Joel H replied on Mon, Jun 28 2010 2:01 PM

Nonoptimal,

Thanks for registering and dropping in, we always like to hear from folks actually doing this kind of work. I've got a question for you regarding DP FPU calculations--I've always heard that the G80/G92/GT200 cards took a heavy hit when doing DP as opposed to SP FPU work, but this is the first I've heard that the cards themselves turn out results that are incorrect. Have you ever had the opportunity to test this in cards past the 8800?

  • | Post Points: 20
Not Ranked
Posts 2
Points 10
Joined: Jun 2010

Joel H

Unfortunately no, school is expensive and so are video cards. I’ve played with DirectX 11 through software emulation which obviously gives you IEEE compliant behavior but that’s no guarantee that DirectX 11 hardware will do the same… though I suspect it will. I’ve never heard ATI or Nvidia explicitly state their hardware is IEEE compliant but the DirectX documentation makes it sound like a card must be in order to claim proper DirectX 11 compatible status.

I wouldn’t really say DirectX 9 and 10 hardware occasionally produce incorrect results it’s just that in some cases the rounding behavior and the handling of floating point specials doesn’t follow IEEE rules. That’s a big deal for an engineer who is accustom to writing code that deals with overflow events, divisions by zero and what have you by depending and very specific standardized behavior.

Support for 64bit floats in DirectX 10 is non-existent (as far as I know anyway) and while it does exist in DirectX 11 it comes with some heavy limitations. You can run shader programs with DP but you can’t store the output without first converting back to SP. This is extremely limiting as most applications use multiple shader programs to successively iterate over the same piece of data (numerical integration with intermediate results stored as texture data). There might be some card out there that supports double precision texture data via some extension but it’s not in any of the DirectX 11 documentation I’ve seen.

Most of the performance hit from DP is due to running out of temporary registers because you need twice as many to do the same amount of work. While a card might have 1024 stream processors if your shader program uses a large number of temporary registers then you are going to get a lot of idle streams sitting around waiting for registers to become available. Converting to single precision (and if you can get away with it half precision) is in some cases faster not because the actual fp ops are quicker but because you have more registers to work with and that equates to more active streams.

  • | Post Points: 5
Top 100 Contributor
Posts 1,038
Points 11,350
Joined: Jul 2009
Joel H replied on Mon, Jun 28 2010 4:31 PM

Nonoptimal,

Have you looked at any of the information on Fermi? I know one of its core features--or at least, one of its core features on the workstation "Tesla" cards--will be DP performance that's far higher than what any card before it has managed.

  • | Post Points: 5
Not Ranked
Posts 1
Points 5
Joined: Jun 2010
BParrella replied on Mon, Jun 28 2010 5:16 PM

Realneil: Ignorance is bliss. But you have to remember, the ignorant don't live very long :P

  • | Post Points: 5
Top 500 Contributor
Posts 119
Points 1,405
Joined: Mar 2010
Location: San Francisco, CA
dlim783 replied on Thu, Jul 1 2010 1:00 PM

That's more like Price vs Performance. AMD/ATI vs Intel/Nvidia. Competition is what's killing the marketing system. E.G. GTX 480 (Power-Hungry) vs HD 5870 (Quiter, Cooler, Faster)

  • | Post Points: 5
Top 100 Contributor
Posts 1,038
Points 11,350
Joined: Jul 2009
Joel H replied on Thu, Jul 1 2010 5:57 PM

I'll give you quieter and cooler, but it's my understanding that the HD 5870 and GTX 480 are pretty well matched as far as performance is concerned.

  • | Post Points: 20
Top 500 Contributor
Posts 119
Points 1,405
Joined: Mar 2010
Location: San Francisco, CA
dlim783 replied on Thu, Jul 1 2010 7:05 PM

Yeah, by a small margin...

  • | Post Points: 5
Page 1 of 1 (19 items) | RSS