Logo   Banner   TopRight
NVIDIA Editor's Day: The Re-Introduction of Tesla
Date: Jun 16, 2008
Author: Chris Angelini

When you visit NVIDIA’s Web site and hit the Products drop-down menu, a long list of the company’s offerings scrolls down in front of you—impressive for an organization that originally found notoriety by designing the fastest desktop display adapters.

Graphics processors for the desktop, workstation, and server space still dominate NVIDIA’s portfolio. But it’s also involved in notebooks, handhelds, software development, and more recently, high-performance computing.

For the uninitiated, high-performance computing (or HPC) has historically involved leveraging large clusters, which are used to crunch applications or algorithms demanding massive horsepower. Physics calculations, weather forecasting, manufacturing, medical imaging, cancer research, financial analysis—these are some of the fields with problems so complex that they require the cooperative efforts of HPC configurations.

The top 500 architectures, broken down by classification

Notice that the fastest supercomputers employ many-core designs

The composition of the fastest 500 systems, maintained at top500.org, has changed fairly dramatically since the metric began in 1993. Back then, symmetric multiprocessing and massively parallel processors dominated the scene. And while some of the most powerful supercomputers today still leverage the strengths of MPP architecture, a majority of the top 500 now employ clusters of commodity x86 processors. That’s not a bad way to fill up a few racks worth of space while sucking down hundreds of kilowatts of power.

Rise of the GPU

The thing about multi-core CPUs is that they’re most effective when your problem isn’t known, thanks to tight execution pipelines and large caches. When the problem being addressed is known, as is often the case in HPC applications, the many-core engines driving the fastest supercomputers are much more potent.

NVIDIA's CUDA programming environment, catching on with a number of enterprises

The Tesla T10 processor sports double precision floating-point, a first for NVIDIA

Care for an example of a many-core processor in action on your desktop? Just look to the GeForce 8-series or Radeon HD 3000-series cards. Each centers on an architecture designed for the extremely parallel problem of graphics rendering. However, with the help of a software programming environment called CUDA, NVIDIA is showing the HPC market how to take advantage of its GPU’s massive floating-point horsepower for more general purpose computing. One specialized application at a time, developers seem to be catching on.

The latest example of NVIDIA’s many-core processors taking the place of computing clusters comes from the University of Antwerp, where researchers studying tomography (imaging by sections, as in CT scans) built up a desktop machine with four GeForce 9800 GX2 graphics cards. According to Dr. K. J. Batenburg, a researcher on the project, just one of the eight GPUs in his system (named FASTRA), is up to 40 times more powerful than a PC in the small cluster he previously had running tomographical reconstructions.

CUDA is very similar to standard C programming with a couple of extensions

A projection of cost, power, and space savings with GPU computing

We recently had the opportunity to visit NVIDIA’s Santa Clara campus for more information on the company’s HPC efforts, and learned from Dan Vivoli, executive vice president of marketing at NVIDIA, that FASTRA is equivalent to about eight tons of rack-mount equipment using 230KW of power, all in a package that’d cost roughly $6,000 to build. During the course of our day we heard from five of NVIDIA’s partners who are now using the company’s graphics technology to accelerate complex problems that were either not possible to affordably solve before,  or simply took much longer to address than they do now.


NVIDIA's Tesla 10-Series, Exposed

But we’re getting ahead of ourselves. Before NVIDIA rolled out the red carpet for its academic and commercial adorers, it introduced a new development in its HPC product lineup promising to more than double the previous generation’s performance numbers.

The Belgian researchers who built FASTRA used a quartet of desktop GeForce 9800 GX2 cards in their supercomputing experiment to get the speed they were looking for at a reasonable price. A business relying on continuous uptime, impeccable accuracy, and drivers optimized for professional software probably wouldn’t follow suit. NVIDIA cautioned that while GeForce works great as a development platform, it’s probably not the best choice for production work.

Next question: so what would someone in a production environment use? Well, there’s Quadro—a name most workstation users likely recognize. Those cards center on the same underlying architecture as the GeForce boards with a handful of hardware enhancements and completely retooled drivers. But you’re still looking at (and paying for) a graphics product. Enter NVIDIA’s Tesla computing processor.

The new Tesla T10 and its 240 cores

A Threading Processor Cluster, in depth

Tesla is already for sale. You can hop onto NVIDIA’s online store and pick up a Tesla C870 for $1,299. The card sports specifications that look a lot like the Quadro FX 5600 (priced at $2,999 on the same site), including 1.5GB of GDDR3 memory, 128 onboard streaming processor cores and a PCI Express x16 interface. There’s no display output, through. The Tesla is exclusive to the HPC market.

For clarity’s sake, NVIDIA classifies Quadro as a superset of Tesla, armed with all of the same features, plus graphics, which is why you pay an extra $1,700 for a Quadro FX 5600 card. All three product lines—GeForce, Tesla, and Quadro—are armed with similar silicon and thus equipped to power through applications enabled by CUDA. 

Tesla 10-Series, Uncovered

The big news at NVIDIA is a second generation of Tesla products based on the 10-series GPU that you’ll also see driving fresh desktop and workstation cards. The 10-series chip is massive, boasting 240 processing cores (nearly 2x its predecessor), 1.4 billion transistors (again, close to double), and close to 1 teraflop of peak single-precision processing power (you guessed it—twice that of the C870 board).

If you’ve already read reviews of the new GeForce boards, then you already have the scoop on Tesla’s T10 processor. The chip is a SIMT (single instruction, multiple thread) architecture, which allows software developers to think about their functions and threads rather than vectors. It, as mentioned, wields 240 thread processors, broken up into 30 TPAs (Threading Processor Arrays) each with eight TPs.

And whereas the previous generation topped out at 1.5 GB of memory, the 10-series supports up to 4GB. If that sounds excessive, consider that HPC datasets often include terabytes worth of information. And we talked to a couple of different developers at the NVIDIA event who said they were holding out for a 4GB card before diving into this latest generation, specifically because it’d give them the most palpable gains. Memory bandwidth also gets a boost thanks to a 512-bit bus that moves up to 102 GBps, up from the 8-series’ 77 GBps peak. Of course, the new Tesla products support PCI Express 2.0, yielding gains in systems with multiple cards contending for bandwidth.

Projected performance numbers, collected by NVIDIA

Many-core versus multi-core scaling, per NVIDIA

Like the new GeForce cards, the Tesla T10 supports IEEE 754 double precision floating-point encoding—much more significant to the HPC community than the desktop. In fact, NVIDIA is confident that the addition of double-precision will open Tesla up to entire market of applications it couldn’t touch before. With that said, a couple technology partners made it a point to observe that intelligent implementation of single-precision is often times as effective as and faster than double. So, it remains to be seen how double-precision support positively affects Tesla’s adoption.


The Tesla S1070 and C1060

Tesla S1070

NVIDIA had two very different Tesla-based SKUs to show off. One was a 1U platform and the other was a standard double-slot PCI Express add-in card.  The rack-mount Tesla S1070 is most impressive. Armed with a quartet of Tesla T10 processors totaling 960 cores, NVIDIA brags that because it controls all aspects of the 1U platform’s power and cooling, the S1070’s four T10s are clocked faster than the standalone card. Each 1.5 GHz processor in the S1070 is complemented by 4GB of GDDR3 memory running at 800 MHz, yielding four teraflops of performance.

NVIDIA's 1U Tesla S1070 platform with four T10 processors

Connecting to a host server using two external PCIe x16 interfaces

If you pop the top on the S1070, you’ll see that the 1U box contains a power supply and the four cards plugged into a riser. There is no motherboard, no CPU, and no system memory. The Tesla S1070 is designed to work in conjunction with a host server rather than as a standalone product. As such, interfacing with the S1070 requires an external connection—or in this case, two. The Tesla employs two second-gen PCI Express x16 pathways between the four T10s and host machine. Naturally, that means you need a server with as many x16 PCIe 2.0 expansion slots, which take NVIDIA’s host interface cards. From there, a pair of half-meter cables joins the host and 1U Tesla box.

The front of NVIDIA's 1U Tesla S1070

Notice the second-gen PCIe x16 ports

Our first question was, “how many rack-mount servers sport multiple PCI Express x16 slots, much less at 2.0 signaling rates?” If you look hard enough though, there are already a handful of them out there. Supermicro, for instance, sells a 4U/pedestal box with Intel’s 5400 chipset inside. A pair of PCIe 2.0 slots serves up the needed bandwidth to get NVIDIA’s Tesla running at full speed.

Tesla C1060

The same Tesla T10 processor is also available on a dual-slot PCI Express x16 add-in card with very similar specifications. According to NVIDIA, it has to clock the board’s GPU slightly lower—1.33 GHz—in order to make up for the fact that each workstation’s airflow and thermal performance is going to be a bit different. You still get 240 thread processors though, and a 512-bit memory interface armed with as much as 4GB of GDDR3. Memory bandwidth remains consistent at 102 GBps, attributable to the 800 MHz clock.  The only other real change is a slight drop in power consumption due to the processor’s lower operating frequency.

From above, the Tesla C1060 looks like an ordinary graphics card

An enveloping thermal solution helps cool the T10 core and surrounding DRAMs

The performance improvements of a single Tesla T10 processor versus last-generation’s Tesla C870 are, as stated by NVIDIA, significantly higher than the horsepower increase might otherwise indicate. On the one hand, NVIDIA presented us with plenty of benchmarks to show the T10 running two, three—even four times faster than the G80. On the other, when we asked NVIDIA about testing the benefits of Tesla objectively ourselves, we learned that there’s no protocol in place yet for evaluating hardware designed to run highly specialized (and often very sensitive) software. And since the apps are so customized, it’d be difficult to draw overarching conclusions about Tesla’s performance given only a handful of benchmarks. Then again, with AMD and Intel sure to follow NVIDIA’s lead in GPU-based computing, there’ll need to be some mechanism for comparison sooner or later.


CUDA and the Future

Enabled By CUDA

Utilizing Tesla’s processing power isn’t as simple as adding a discrete card to your favorite workstation. Rather, the application you’re looking to accelerate must be written specifically to take advantage of the hardware, which, in the case of existing software, means an effort must be made to re-code in NVIDIA’s CUDA development environment. Yet, with 70 million CUDA-capable GPUs in the wild and a second-generation recently launched, developers are starting to see a much larger potential audience for CUDA-enabled apps.

NVIDIA recently posted the beta of CUDA 2.0, which adds support for the 32- and 64-bit versions of Windows Vista and the Tesla T10’s double-precision capability. The package includes a CUDA toolkit and the CUDA SDK—CUDA-compatible drivers are already included in every display driver download. Note that CUDA is free and doesn't require registration to download. If you're into programming and want to give NVIDIA's latest a look, the company wants you to dive in.

CUDA support, from integrated core logic to add-in boards to Tesla servers

The Tesla C1060 runs at 1.33 GHz and includes up to 4GB of GDDR3 memory

Five organizations with experience in CUDA were represented at the NVIDIA event, each with a slightly different spin on the technology in action. For instance, a company called TechniScan Medical Systems is using CUDA-enabled software for a very practical purpose. Its Whole Breast Ultrasound system creates three-dimensional images, which are then used to help diagnose abnormalities. Powered by a Pentium M cluster, the scanner takes nearly five hours to create its image—far too long for a doctor to take the scan and go over it with her patient in the same visit. A 16-core Core 2-based cluster cuts the time down to 45 minutes. A quartet of Tesla GPUs gets that number down to 16 minutes; much more acceptable for same-visit results.

We also saw presentations demonstrating the benefits of GPU computing through the eyes of astronomers, the finance industry, cancer research, and academia, where HPC, scientific computing, visualization, virtual 3D audio, and computer vision all come together.

More relevant on the desktop, a company called Elemental Technologies is developing a non-linear editor for Adobe’s Premiere Pro that will let you render Blu-ray-quality H.264 video in real-time. So long as you have a Quadro card based on at least the G80 GPU, the CUDA-enabled add-on will yield performance gains.

The Future of HPC?

Let’s say you work for an enterprise and you’ve been tasked with designing a datacenter capable of 100 teraflops. According to NVIDIA, tackling the job with quad-core CPUs in 1U boxes, you’d need 1,429 servers at $4,000 apiece. Sipping 400W each, that cluster would consume 571KW of power and cost nearly $6 million dollars to procure.

You’d purportedly get the same horsepower from 25 servers and as many 1U Tesla S1070 systems running in a heterogeneous computing environment. With each of the Tesla boxes priced around $8,000, added to the $4,000 servers, you’d be looking at a scant $300,000 total. And even though the S1070s use as much as 700W each, the total package would still only need 27KW.

Top-down on the Tesla S1070, comprised of four T10-based cards and a power supply

The add-in Tesla C1060 doesn't include display outputs--it's all about accelerating HPC applications.

Of course, NVIDIA’s scenario leaves out one important data point: the cost of re-coding an application to run in the CUDA environment. No doubt that’ll be substantial for many considering a transition to GPU-based computing. We’re only 12 months into the movement though, and already NVIDIA says there are a few dozen commercial CUDA applications.

No doubt the prospects for businesses and academic institutions are great. But the most interesting thing for desktop enthusiasts is that any 8-series GeForce card and higher supports CUDA. So, as optimized software continues emerging, the hardware infrastructure is in place—even if you won’t get the massive performance of a Tesla or Quadro card loaded down with 4GB of memory.


Content Property of HotHardware.com