NVIDIA Pascal GPU Architecture Preview: Inside The GP100

NVIDIA Pascal GPU Architecture

At last week’s GPU Technology Conference, NVIDIA’s CEO Jen-Hsun Huang unveiled a couple of key products and technologies that target the High-Performance Computing, or HPC, space, including the Tesla P100 data center accelerator and its companion DGX-1 deep-learning system, which is a powerful server that has up to eight Tesla P100 cards at its core. Today though, we have more information about the underlying architecture employed in the P100, otherwise known NVIDIA’s Pascal GPU architecture.

Pascal is the follow-up to the Maxwell architecture, which is leveraged in NVIDIA’s current-generation of graphics cards and mobile GPUs. And the Pascal-based GPU at the heart of the Tesla P100 is codenamed the GP100 and it promises to be a very different animal. 

tesla p100 board

tesla p100 connector
NVIDIA Tesla P100, Featuring The GP100 GPU With HBM2

If NVIDIA’s past GPU naming convention rings-true throughout the entire next-generation, the GP100 will be the “big” version of Pascal, and presumably scaled down iterations of the chip will power more mainstream consumer-class GPUs, at least initially. With Maxwell, the “big” GM200 didn’t appear in a consumer-targeted GPU until well after cards based on the GM204, and smaller, Maxwell-based GPUs had already been on the market for quite some time. Let's look at previous generation Tesla implementations for some perspective...

Tesla K40 Tesla M40 Tesla P100
GPU  GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal)
SMs 15 24 56
TPCs 15 24 28
FP32 CUDA Cores / SM 192 128 64
FP32 CUDA Cores / GPU 2880 3072 3584
FP64 CUDA Cores / SM 64 4 32
FP64 CUDA Cores / GPU 960 96 1792
Base Clock 745 MHz 948 MHz 1328 MHz
GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz
FP64 GFLOPs 1680 213 5304
Texture Units 240 192 224
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2
Memory Size Up to 12 GB Up to 24 GB 16 GB
L2 Cache Size 1536 KB 3072 KB 4096 KB
Register File Size / SM 256 KB 256 KB 256 KB
Register File Size / GPU 3840 KB 6144 KB 14336 KB
TDP 235 Watts 250 Watts 300 Watts
Transistors 7.1 billion 8 billion 15.3 billion
GPU Die Size 551 mm² 601 mm² 610 mm²
Manufacturing Process 28-nm 28-nm 16-nm FinFET

Based on what we know so far about the GP100, it is an absolute beast of a GPU. It’s got roughly 3x the compute performance, 5x the GPU-to-GPU bandwidth, and 3x the memory bandwidth of NVIDIA’s previous generation high-end products. The full complement of features and specifications that have been revealed to date are represented in the table above.

p100 chart

The GP100 will be manufactured using TSMC’s 16nm FinFET process. The GPU is comprised of roughly 15.3 billion transistors and has a die size measuring 610mm2. That’s about the same size as the Maxwell-based GM200, which comes in at about 601mm2, but with nearly double the number of transistors – 15.3 billion vs. 8 billion. In addition to the advanced manufacturing process, NVIDIA's GP100 will also make use of HBM2 (second-generation of High Bandwidth Memory), and leverage new technologies like NVLink, Unified Memory, and a new board / connector design.

nvidia pascal gp100 die map
NVIDIA GP100 GPU Block Diagram

In its full implementation, the GP100 features 60 streaming multiprocessors (SM). As configured in the Tesla P100, however, only 56 of those SMs are enabled. The base clock of the GPU is an impressive 1348MHz, with a boost clock of 1480MHz, and a 300 watt TDP. Considering how young TSMC’s 16nm FinFET process is, seeing clocks this high on such a big chip bodes well for NVIDIA. As configured, and with those clocks, the GP100-powered Tesla P100 offers 5.3 teraflops (TFLOPs) of double-precision compute performance, 10.6 TFLOPs of full-precision compute, and 21.2 TFLOPs at half precision. We should also mention that atomic addition is available at double-precision with Pascal, while with Maxwell it is not.

nvidia pascal sm
NVIDIA Pascal SM Configuration In The GP100

Inside the GP100, those 56 active SMs house a total of 3584 FP32 cores, or 1792 FP 64 cores. There are 64 FP32 / 32 FP64 cores per SM, and 224 total texture units. The GPU links to its 16GB of HMB2 memory via 4096-bit interface, which offers up 720GB/s of peak bandwidth. There is 4MB of L2 cache on the chip, and a 256K register file per SM, for a total of 14,336KB. That's double the registers of the previous generation, with 1.33x the shared memory capacity, and double the shared memory bandwidth. In other words, this thing is massive but let's dive in deeper... 

Tags:  Nvidia, GPU, Tesla, P100, pascal, gp100

Related content