Nvidia Launches Supercomputing-focused K20, K20X

Nvidia Launches Supercomputing-focused K20, K20X

When Nvidia launched the consumer-oriented GK104 earlier this year, the company made it clear that the enthusiast-oriented GPU was the first iteration of a two-GPU strategy. K20, we were told, would launch later in the year, with certain features aimed at accelerating supercomputing and HPC workloads. Today, Nvidia is taking the wraps off that second GPU. As expected, it's a monster; the Nvidia K20, based on the GK110 GPU weighs in at 7.1B transistors, double the GK104's 3.54B.  

GK110 keeps Kepler's basic SMX structure. Each SMX unit contains 192 CUDA cores, 32 load/store units, 16 texture units, and 4 warp schedulers. There are 15 SMX units per die as compared to Kepler's eight. Neither the K20 or the K20X that Nvidia is announcing today actually use all 15 SMX's, K20 has 13 enabled, K20X has 14. The memory bus is also larger; at 384 bits for the K20X and 320-bit for the K20.

One major difference between GK110 and GK104 is the allocation of double-precision floating point units into the SMX. The GK104 and GK110 SMX's are shown below, though it's not clear if the GK110 actually has an entirely separate double-precision FPU, or if it uses pairs of single-precision units to hit high throughput figures. The GK110 is on the left below, the GK104 is on the right.

The GK110 is capable of pairing double-precision operations with other instructions (Fermi and GK104 couldn't) and the number of registers each thread can access has been quadrupled, from 63 to 255. Threads within a warp are now capable of sharing data. K20 also supports a greater number of atomic operations and brings two new features to the table: Dynamic Parallelism and Hyper-Q.

Dynamic Parallelism refers to the GPU's ability to spin off new threads of work directly without passing data back to the CPU. This reduces execution latency and improves power efficiency by leaving the CPU free for other tasks. Hyper-Q takes a bit more explanation. GK104 and the Fermi chips that came before it supported 16-way concurrency between different work streams but ultimately aggregated the work into a single execution queue. What this means in English is that it was difficult to take full advantage of Fermi's execution resources when juggling multiple workloads or performing different tasks.

Hyper-Q, according to Nvidia, "allows connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks thereby limiting GPU utilization, can see up see up to a 32x performance increase without changing any code." 32x is obviously a worst-case scenario for Fermi, but the advantage here is real.

Finally, there's GPUDirect. GPUDirect is a technology that allows other devices to query the GPU in a node without waiting on the CPU to handle the transaction. According to Nvidia, this boosts higher aggregate bandwidth for cross-GPU data sharing and should lower overall latency as well.

Not Exactly Built For Gaming

When the K20 was announced, there was considerable speculation that Nvidia might launch an ultra-high-end consumer variant of this SKU. While that's still technically possible, it seems rather unlikely. The K20/K20X don't have video outputs, but even if they did, they don't offer the consumer market much that it doesn't have already. Here's how the K20 and K20X fit into Nvidia's current server/workstation lineup:

The K20 is equivalent to a GTX 690, for those of you wanting to see the consumer side of the equation. Dual GPUs never scale as perfectly in real life as they do on paper, but the K10 is substantially more powerful than the K20/K20X by most metrics -- and the actual GTX 690 is more powerful than that. Granted, SLI scaling is never perfect, but it's not clear that the K20X's features would automatically give gaming a major boost.

If Nvidia does launch a consumer variant, it'll likely tweak the speeds and feeds to enable higher clock speeds while maintaining a higher overall number of cores.

Ringside Seats To The Many-Core Grudge Match

Intel, Nvidia, and AMD are all launching products at SC12 this week. AMD is doing so relatively quietly, and has aimed its products at virtualization environments and data centers. Intel and Nvidia, in contrast, are slugging it out in a relatively public forum.

Both companies are fighting to position themselves as offering better compatibility, stronger real-world performance, and easier code optimization. Nvidia has the stronger baseline position -- it's GPUs already power a significant number of the TOP500 supercomputers -- but Intel's software optimization resources and libraries are second to none.

The one thing all three companies agree on is that many-core architectures are the key to reaching exascale-level computers. Given this, we expect this fight to only get uglier from here on in.
+ -

This will be an interesting market to watch.

One would think that the more open platform would prevail, but Intel is the BORG and they will assimilate,......

+ -

In this case, Intel is driving the more open approach. They're backing the OpenMP proposal for vendor-neutral offload directives that will work on nVidia, AMD and Intel accelerators and coprocessors. OpenACC is designed specifically for nVidia architectures only. Also, the Xeon Phi coprocessor runs open BSD Linux and is programmed with standard languages including C++ and Fortran. CUDA is entirely proprietary.

Login or Register to Comment
Post a Comment
Username:   Password: