NVIDIA Pascal GPU Architecture Preview: Inside The GP100
NVIDIA Pascal GPU Architecture
Pascal is the follow-up to the Maxwell architecture, which is leveraged in NVIDIA’s current-generation of graphics cards and mobile GPUs. And the Pascal-based GPU at the heart of the Tesla P100 is codenamed the GP100 and it promises to be a very different animal.
If NVIDIA’s past GPU naming convention rings-true throughout the entire next-generation, the GP100 will be the “big” version of Pascal, and presumably scaled down iterations of the chip will power more mainstream consumer-class GPUs, at least initially. With Maxwell, the “big” GM200 didn’t appear in a consumer-targeted GPU until well after cards based on the GM204, and smaller, Maxwell-based GPUs had already been on the market for quite some time. Let's look at previous generation Tesla implementations for some perspective...
Tesla K40 | Tesla M40 | Tesla P100 | |
GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) |
SMs | 15 | 24 | 56 |
TPCs | 15 | 24 | 28 |
FP32 CUDA Cores / SM | 192 | 128 | 64 |
FP32 CUDA Cores / GPU | 2880 | 3072 | 3584 |
FP64 CUDA Cores / SM | 64 | 4 | 32 |
FP64 CUDA Cores / GPU | 960 | 96 | 1792 |
Base Clock | 745 MHz | 948 MHz | 1328 MHz |
GPU Boost Clock | 810/875 MHz | 1114 MHz | 1480 MHz |
FP64 GFLOPs | 1680 | 213 | 5304 |
Texture Units | 240 | 192 | 224 |
Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 |
Memory Size | Up to 12 GB | Up to 24 GB | 16 GB |
L2 Cache Size | 1536 KB | 3072 KB | 4096 KB |
Register File Size / SM | 256 KB | 256 KB | 256 KB |
Register File Size / GPU | 3840 KB | 6144 KB | 14336 KB |
TDP | 235 Watts | 250 Watts | 300 Watts |
Transistors | 7.1 billion | 8 billion | 15.3 billion |
GPU Die Size | 551 mm² | 601 mm² | 610 mm² |
Manufacturing Process | 28-nm | 28-nm | 16-nm FinFET |
Based on what we know so far about the GP100, it is an absolute beast of a GPU. It’s got roughly 3x the compute performance, 5x the GPU-to-GPU bandwidth, and 3x the memory bandwidth of NVIDIA’s previous generation high-end products. The full complement of features and specifications that have been revealed to date are represented in the table above.
The GP100 will be manufactured using TSMC’s 16nm FinFET process. The GPU is comprised of roughly 15.3 billion transistors and has a die size measuring 610mm2. That’s about the same size as the Maxwell-based GM200, which comes in at about 601mm2, but with nearly double the number of transistors – 15.3 billion vs. 8 billion. In addition to the advanced manufacturing process, NVIDIA's GP100 will also make use of HBM2 (second-generation of High Bandwidth Memory), and leverage new technologies like NVLink, Unified Memory, and a new board / connector design.
In its full implementation, the GP100 features 60 streaming multiprocessors (SM). As configured in the Tesla P100, however, only 56 of those SMs are enabled. The base clock of the GPU is an impressive 1348MHz, with a boost clock of 1480MHz, and a 300 watt TDP. Considering how young TSMC’s 16nm FinFET process is, seeing clocks this high on such a big chip bodes well for NVIDIA. As configured, and with those clocks, the GP100-powered Tesla P100 offers 5.3 teraflops (TFLOPs) of double-precision compute performance, 10.6 TFLOPs of full-precision compute, and 21.2 TFLOPs at half precision. We should also mention that atomic addition is available at double-precision with Pascal, while with Maxwell it is not.
Inside the GP100, those 56 active SMs house a total of 3584 FP32 cores, or 1792 FP 64 cores. There are 64 FP32 / 32 FP64 cores per SM, and 224 total texture units. The GPU links to its 16GB of HMB2 memory via 4096-bit interface, which offers up 720GB/s of peak bandwidth. There is 4MB of L2 cache on the chip, and a 256K register file per SM, for a total of 14,336KB. That's double the registers of the previous generation, with 1.33x the shared memory capacity, and double the shared memory bandwidth. In other words, this thing is massive but let's dive in deeper...