When you visit NVIDIA’s Web site and hit the Products drop-down menu, a long list of the company’s offerings scrolls down in front of you—impressive for an organization that originally found notoriety by designing the fastest desktop display adapters.
The composition of the fastest 500 systems, maintained at top500.org, has changed fairly dramatically since the metric began in 1993. Back then, symmetric multiprocessing and massively parallel processors dominated the scene. And while some of the most powerful supercomputers today still leverage the strengths of MPP architecture, a majority of the top 500 now employ clusters of commodity x86 processors. That’s not a bad way to fill up a few racks worth of space while sucking down hundreds of kilowatts of power.
Rise of the GPU
The thing about multi-core CPUs is that they’re most effective when your problem isn’t known, thanks to tight execution pipelines and large caches. When the problem being addressed is known, as is often the case in HPC applications, the many-core engines driving the fastest supercomputers are much more potent.
Care for an example of a many-core processor in action on your desktop? Just look to the GeForce 8-series or Radeon HD 3000-series cards. Each centers on an architecture designed for the extremely parallel problem of graphics rendering. However, with the help of a software programming environment called CUDA, NVIDIA is showing the HPC market how to take advantage of its GPU’s massive floating-point horsepower for more general purpose computing. One specialized application at a time, developers seem to be catching on.
The latest example of NVIDIA’s many-core processors taking the place of computing clusters comes from the University of Antwerp, where researchers studying tomography (imaging by sections, as in CT scans) built up a desktop machine with four GeForce 9800 GX2 graphics cards. According to Dr. K. J. Batenburg, a researcher on the project, just one of the eight GPUs in his system (named FASTRA), is up to 40 times more powerful than a PC in the small cluster he previously had running tomographical reconstructions.
We recently had the opportunity to visit NVIDIA’s Santa Clara campus for more information on the company’s HPC efforts, and learned from Dan Vivoli, executive vice president of marketing at NVIDIA, that FASTRA is equivalent to about eight tons of rack-mount equipment using 230KW of power, all in a package that’d cost roughly $6,000 to build. During the course of our day we heard from five of NVIDIA’s partners who are now using the company’s graphics technology to accelerate complex problems that were either not possible to affordably solve before, or simply took much longer to address than they do now.
|NVIDIA's Tesla 10-Series, Exposed|
But we’re getting ahead of ourselves. Before NVIDIA rolled out the red carpet for its academic and commercial adorers, it introduced a new development in its HPC product lineup promising to more than double the previous generation’s performance numbers.
Tesla is already for sale. You can hop onto NVIDIA’s online store and pick up a Tesla C870 for $1,299. The card sports specifications that look a lot like the Quadro FX 5600 (priced at $2,999 on the same site), including 1.5GB of GDDR3 memory, 128 onboard streaming processor cores and a PCI Express x16 interface. There’s no display output, through. The Tesla is exclusive to the HPC market.
For clarity’s sake, NVIDIA classifies Quadro as a superset of Tesla, armed with all of the same features, plus graphics, which is why you pay an extra $1,700 for a Quadro FX 5600 card. All three product lines—GeForce, Tesla, and Quadro—are armed with similar silicon and thus equipped to power through applications enabled by CUDA.
Tesla 10-Series, Uncovered
The big news at NVIDIA is a second generation of Tesla products based on the 10-series GPU that you’ll also see driving fresh desktop and workstation cards. The 10-series chip is massive, boasting 240 processing cores (nearly 2x its predecessor), 1.4 billion transistors (again, close to double), and close to 1 teraflop of peak single-precision processing power (you guessed it—twice that of the C870 board).
If you’ve already read reviews of the new GeForce boards, then you already have the scoop on Tesla’s T10 processor. The chip is a SIMT (single instruction, multiple thread) architecture, which allows software developers to think about their functions and threads rather than vectors. It, as mentioned, wields 240 thread processors, broken up into 30 TPAs (Threading Processor Arrays) each with eight TPs.
And whereas the previous generation topped out at 1.5 GB of memory, the 10-series supports up to 4GB. If that sounds excessive, consider that HPC datasets often include terabytes worth of information. And we talked to a couple of different developers at the NVIDIA event who said they were holding out for a 4GB card before diving into this latest generation, specifically because it’d give them the most palpable gains. Memory bandwidth also gets a boost thanks to a 512-bit bus that moves up to 102 GBps, up from the 8-series’ 77 GBps peak. Of course, the new Tesla products support PCI Express 2.0, yielding gains in systems with multiple cards contending for bandwidth.
Like the new GeForce cards, the Tesla T10 supports IEEE 754 double precision floating-point encoding—much more significant to the HPC community than the desktop. In fact, NVIDIA is confident that the addition of double-precision will open Tesla up to entire market of applications it couldn’t touch before. With that said, a couple technology partners made it a point to observe that intelligent implementation of single-precision is often times as effective as and faster than double. So, it remains to be seen how double-precision support positively affects Tesla’s adoption.
|The Tesla S1070 and C1060|
If you pop the top on the S1070, you’ll see that the 1U box contains a power supply and the four cards plugged into a riser. There is no motherboard, no CPU, and no system memory. The Tesla S1070 is designed to work in conjunction with a host server rather than as a standalone product. As such, interfacing with the S1070 requires an external connection—or in this case, two. The Tesla employs two second-gen PCI Express x16 pathways between the four T10s and host machine. Naturally, that means you need a server with as many x16 PCIe 2.0 expansion slots, which take NVIDIA’s host interface cards. From there, a pair of half-meter cables joins the host and 1U Tesla box.
Our first question was, “how many rack-mount servers sport multiple PCI Express x16 slots, much less at 2.0 signaling rates?” If you look hard enough though, there are already a handful of them out there. Supermicro, for instance, sells a 4U/pedestal box with Intel’s 5400 chipset inside. A pair of PCIe 2.0 slots serves up the needed bandwidth to get NVIDIA’s Tesla running at full speed.
The same Tesla T10 processor is also available on a dual-slot PCI Express x16 add-in card with very similar specifications. According to NVIDIA, it has to clock the board’s GPU slightly lower—1.33 GHz—in order to make up for the fact that each workstation’s airflow and thermal performance is going to be a bit different. You still get 240 thread processors though, and a 512-bit memory interface armed with as much as 4GB of GDDR3. Memory bandwidth remains consistent at 102 GBps, attributable to the 800 MHz clock. The only other real change is a slight drop in power consumption due to the processor’s lower operating frequency.
The performance improvements of a single Tesla T10 processor versus last-generation’s Tesla C870 are, as stated by NVIDIA, significantly higher than the horsepower increase might otherwise indicate. On the one hand, NVIDIA presented us with plenty of benchmarks to show the T10 running two, three—even four times faster than the G80. On the other, when we asked NVIDIA about testing the benefits of Tesla objectively ourselves, we learned that there’s no protocol in place yet for evaluating hardware designed to run highly specialized (and often very sensitive) software. And since the apps are so customized, it’d be difficult to draw overarching conclusions about Tesla’s performance given only a handful of benchmarks. Then again, with AMD and Intel sure to follow NVIDIA’s lead in GPU-based computing, there’ll need to be some mechanism for comparison sooner or later.
|CUDA and the Future|
Enabled By CUDA
Five organizations with experience in CUDA were represented at the NVIDIA event, each with a slightly different spin on the technology in action. For instance, a company called TechniScan Medical Systems is using CUDA-enabled software for a very practical purpose. Its Whole Breast Ultrasound system creates three-dimensional images, which are then used to help diagnose abnormalities. Powered by a Pentium M cluster, the scanner takes nearly five hours to create its image—far too long for a doctor to take the scan and go over it with her patient in the same visit. A 16-core Core 2-based cluster cuts the time down to 45 minutes. A quartet of Tesla GPUs gets that number down to 16 minutes; much more acceptable for same-visit results.
We also saw presentations demonstrating the benefits of GPU computing through the eyes of astronomers, the finance industry, cancer research, and academia, where HPC, scientific computing, visualization, virtual 3D audio, and computer vision all come together.
More relevant on the desktop, a company called Elemental Technologies is developing a non-linear editor for Adobe’s Premiere Pro that will let you render Blu-ray-quality H.264 video in real-time. So long as you have a Quadro card based on at least the G80 GPU, the CUDA-enabled add-on will yield performance gains.
The Future of HPC?
Let’s say you work for an enterprise and you’ve been tasked with designing a datacenter capable of 100 teraflops. According to NVIDIA, tackling the job with quad-core CPUs in 1U boxes, you’d need 1,429 servers at $4,000 apiece. Sipping 400W each, that cluster would consume 571KW of power and cost nearly $6 million dollars to procure.
You’d purportedly get the same horsepower from 25 servers and as many 1U Tesla S1070 systems running in a heterogeneous computing environment. With each of the Tesla boxes priced around $8,000, added to the $4,000 servers, you’d be looking at a scant $300,000 total. And even though the S1070s use as much as 700W each, the total package would still only need 27KW.
Of course, NVIDIA’s scenario leaves out one important data point: the cost of re-coding an application to run in the CUDA environment. No doubt that’ll be substantial for many considering a transition to GPU-based computing. We’re only 12 months into the movement though, and already NVIDIA says there are a few dozen commercial CUDA applications.
No doubt the prospects for businesses and academic institutions are great. But the most interesting thing for desktop enthusiasts is that any 8-series GeForce card and higher supports CUDA. So, as optimized software continues emerging, the hardware infrastructure is in place—even if you won’t get the massive performance of a Tesla or Quadro card loaded down with 4GB of memory.