NVIDIA GeForce RTX Explored: What You Need To Know About Turing
NVIDIA GeForce RTX Turing Architecture - The GPUs, Speed, And Feeds
Many other deeper technical details regarding the Turing GPU microarchitecture were kept closer to the vest and revealed only to smaller group of attendees at the event. Today we’re now able to disclose some of those technical details and features, however, and we'll lay them out for you on the pages ahead. Before we dig in though, we were also lucky enough to have NVIDIA’s own Tom "TAP" Petersen on a recent podcast to discuss Turing and the GeForce RTX series as a whole. During our chat, a number of interesting tid-bits were revealed. If you’re so inclined, we strongly suggest checking this out as well...
The initial GeForce RTX 2000 series graphics card line-up is comprised of the GeForce RTX 2070, RTX 2080, and RTX 2080 Ti. There are also higher-clocked Founder’s Editions of each card coming as well. All of the cards are based on NVIDIA’s Turing microarchitecture and offer similar features, but they are all powered by different Turing GPU variants as well. The high-end GeForce RTX 2080 Ti is built around the TU102 GPU, the RTX 2080 is built on the TU104, and the RTX 2070 the TU106.
The block diagram above is a representation of the full TU102 configuration. The TU102 is comprised of approximately 18.6B transistors, and when fully enabled it features 72 Streaming Multiprocessors (SMs), 4,608 CUDA cores, 576 Tensor cores, 72 RT (Ray Tracing) cores, 36 geometry units (TPCs), 288 texture units, 96 ROPs, a 384-bit (12-channel) memory interface, and dual NVLink channels. Note, however, that the flagship GeForce RTX 2080 Ti does not feature the full TU102 enablement.
NVIDIA's new Quadro RTX 6000 pro graphic card is powered by a fully enabled TU102, but the GeForce RTX 2080 Ti has two TPCs, fours SMs, 256 CUDA cores, four RT cores, eight ROPs, 16 texture units, and one memory channel disabled (specifically, 4352 CUDA cores, 552 Tensor cores and 68 RT cores in the RTX 2080 Ti). Exact core count, memory, and clock configurations for all of the GeForce RTX 2000 series cards based on Turing, the Quadro RTX 6000, and their Pascal-based counterparts are represented in the table above. Definitely click that image and spend some time looking through the specs, because there is a ton of data to digest, including some new terms you may not have heard before.
As you’ll note, the GeForce RTX 2080 and RTX 2070 have the same feature-set as their big brother, but their core counts and memory configurations are further reduced (and their block diagrams just look a bit smaller). The transistor counts for the TU102 and TU104 are commensurately reduced as well. What is worth noting, however, is just how much bigger the Turing GPUs actually are versus Pascal. Despite being manufactured on a denser, more advanced 12nm Fin-Fet process, all of the Turing-based GPUs not only have much higher transistor counts than their predecessors, but they are much bigger chips too.
The significantly larger die sizes on the Turing-based GeForce RTX GPUs -- despite being manufactured on a more advanced process -- are mostly due to the additional technologies NVIDIA incorporated into the chips. With the GeForce RTX series, NVIDIA wanted to make the cards perform well with the traditional shading and rasterization methods used in all of today’s (and yesterday’s) games, but also wanted to lay the foundation for the AI, Deep Learning, and Ray Tracing-enabled games and applications it hopes are the future, hence the addition of RT and Tensor cores into the mix. All of those additional cores equal additional transistor count and a larger die size, plain and simple.
With additional CUDA cores – that are also more efficient and have some new capabilities – significantly more memory bandwidth, and increased texturing performance, Turing-based GeForce RTX cards should also offer more performance with existing titles, while also supporting the new technologies enabled by the RT and Tensor cores, and NVIDIA’s related software framework.
Over and above the new processors in Turing though, NVIDIA has designed in significant optimizations to improve utilization, performance, and efficiency of the shaders and other units in the GPU. For example, the math pipeline in Turing has been revamped and optimized and can now issue integer and floating point instructions concurrently. With some workloads, NVIDIA claims this tweak alone can boost performance by approximately 36%.
NVIDIA’s Turing-based GPUs also have double the amount of L2 cache as their predecessors and the L1 cache has been outfitted with a wider bus that ultimately doubles the bandwidth. There is also more total L1 cache and shared memory, and the configuration has been changed to be more symmetrical. The changes in Turing can result in up to 50% faster shading within the CUDA cores, but the GPUs also have those Tensor and Ray Tracing cores at their disposal.
Developers will have to explicitly leverage those new cores, of course, but they add some significant capabilities should they be put to use. The Tensor cores, which are ideally suited for deep learning workloads, like image recognition and inferencing, offer up to 110TFLOPs of compute performance with FP16 workloads, or 228 or 445 TOPS with INT8 or INT4 workloads, respectively, in the TU102 at least – those numbers are obviously lower in the smaller TU104 and TU106. The RT cores can offer up to “10 Giga Rays/sec” in the TU102, which is a somewhat nebulous performance metric on its own at this point in time, but consider this – a GeForce GTX 1080 Ti offers up to 11.3 TFLOPS of compute performance, can handle about 1.1 Giga Rays/sec, or roughly 10 TFLOPS per Gigaray. In short, the GeForce GTX 2080 Ti is roughly 10x faster than a GeForce GTX 1080 Ti with the same ray tracing workload.
We should also mention that all of the processing engines inside Turing – the shaders, RT cores, Tensor cores -- can be utilized concurrently, but the dispatch unit can only feed two units simultaneously. Since the tensor cores are usually for specialized workloads though, and leveraged at a different stage of the rendering process, not being able to feed all three simultaneously shouldn't be an issue for developers.
To make sure Turing has fast access to lots of data, NVIDIA has also incorporated a bleeding-edge GDDR6 memory controller into the GPUs. The bandwidth per-pin with the GDDR6 memory employed on the initial GeForce RTX 2000 series cards tops out at an effective 14Gb/s (7GHz). To achieve that data rate, NVIDIA had to optimize the I/O circuit architecture and pay special attention to the channel between the GPU and individual memory dies to ensure the cleanest possible signaling (which GDDR6 inherently helps out with as well). Over and above the high-speed GDDR6, Turing also features more advanced memory compression technology versus Pascal. So, not only do the GeForce RTX series cards offer more bandwidth, that bandwidth is utilized more efficiently.
The 256-bit memory bus width of the GeForce RTX 2070 and 2080, coupled with that speedy 14Gb/s effective data rate results in 448GB/s of peak available bandwidth, which is much higher than the 256GB/s (+75%) and 320GB/s (+39%) of NVIDIA's previous gen GeForce GTX 1070 and GTX 1080. The wider 352-bit bus on the flagship GeForce GTX 2080 Ti results in peak memory bandwidth of 616GB/s, which is a 27% increase over the 484GB/s of the GeForce GTX 1080 Ti.
Let's give a closer look at NVIDIA's new Founder's Edition GeForce RTX cards, next...