The Tesla S1070 and C1060
NVIDIA had two very different Tesla-based SKUs to show off. One was a 1U platform and the other was a standard double-slot PCI Express add-in card. The rack-mount Tesla S1070 is most impressive. Armed with a quartet of Tesla T10 processors totaling 960 cores, NVIDIA brags that because it controls all aspects of the 1U platform’s power and cooling, the S1070’s four T10s are clocked faster than the standalone card. Each 1.5 GHz processor in the S1070 is complemented by 4GB of GDDR3 memory running at 800 MHz, yielding four teraflops of performance.
NVIDIA's 1U Tesla S1070 platform with four T10 processors
Connecting to a host server using two external PCIe x16 interfaces
If you pop the top on the S1070, you’ll see that the 1U box contains a power supply and the four cards plugged into a riser. There is no motherboard, no CPU, and no system memory. The Tesla S1070 is designed to work in conjunction with a host server rather than as a standalone product. As such, interfacing with the S1070 requires an external connection—or in this case, two. The Tesla employs two second-gen PCI Express x16 pathways between the four T10s and host machine. Naturally, that means you need a server with as many x16 PCIe 2.0 expansion slots, which take NVIDIA’s host interface cards. From there, a pair of half-meter cables joins the host and 1U Tesla box.
The front of NVIDIA's 1U Tesla S1070
Notice the second-gen PCIe x16 ports
Our first question was, “how many rack-mount servers sport multiple PCI Express x16 slots, much less at 2.0 signaling rates?” If you look hard enough though, there are already a handful of them out there. Supermicro, for instance, sells a 4U/pedestal box with Intel’s 5400 chipset inside. A pair of PCIe 2.0 slots serves up the needed bandwidth to get NVIDIA’s Tesla running at full speed.
The same Tesla T10 processor is also available on a dual-slot PCI Express x16 add-in card with very similar specifications. According to NVIDIA, it has to clock the board’s GPU slightly lower—1.33 GHz—in order to make up for the fact that each workstation’s airflow and thermal performance is going to be a bit different. You still get 240 thread processors though, and a 512-bit memory interface armed with as much as 4GB of GDDR3. Memory bandwidth remains consistent at 102 GBps, attributable to the 800 MHz clock. The only other real change is a slight drop in power consumption due to the processor’s lower operating frequency.
From above, the Tesla C1060 looks like an ordinary graphics card
An enveloping thermal solution helps cool the T10 core and surrounding DRAMs
The performance improvements of a single Tesla T10 processor versus last-generation’s Tesla C870 are, as stated by NVIDIA, significantly higher than the horsepower increase might otherwise indicate. On the one hand, NVIDIA presented us with plenty of benchmarks to show the T10 running two, three—even four times faster than the G80. On the other, when we asked NVIDIA about testing the benefits of Tesla objectively ourselves, we learned that there’s no protocol in place yet for evaluating hardware designed to run highly specialized (and often very sensitive) software. And since the apps are so customized, it’d be difficult to draw overarching conclusions about Tesla’s performance given only a handful of benchmarks. Then again, with AMD and Intel sure to follow NVIDIA’s lead in GPU-based computing, there’ll need to be some mechanism for comparison sooner or later.