NVIDIA GeForce RTX 40 Architecture Overview: Ada's Special Sauce Unveiled

A Deep Dive Into The NVIDIA Ada Lovelace GPU Architecture Powering The RTX 40 Series

nvidia editors day 6
During the NVIDIA GTC 2022 keynote earlier this week, CEO Jensen Huang announced a slew of new products, technologies, and services, targeting everything from enthusiast gaming PCs to autonomous cars, data centers and the metaverse. We covered the announcements as they happened. If you missed all of the excitement as it unfolded, you can find all of the GeForce RTX 40 series details and everything Omniverse, Auto, and Robotics related on our news page.

Shortly after the keynote, we had the chance to chat with NVIDIA and dive a little deeper into what makes the upcoming GeForce RTX 40 series tick, and have much of that information detailed for you here. As regular readers will know, the new GeForce RTX 40 series features NVIDIA’s Ada Lovelace GPU architecture, the follow-on to the company's RTX 30 series’ Ampere architecture, which enables a number of new capabilities and new-found levels of performance and image fidelity.

NVIDIA's GeForce RTX 40 Series: Coming Soon

nvidia editors day 13
The “Ada” architecture, as many of NVIDIA’s execs like to call it, features updated Streaming Multiprocessors (SM), new RT Cores, with double the ray-triangle intersection throughput of Ampere, and a new Tensor Core design featuring the Hopper FP8 Transformer Engine that offers up to 1.4 petaflops of Tensor processing power.

In addition to boosting performance versus previous-gen offerings, new capabilities inside the Ada architecture and the updated GeForce RTX 40 series allowed NVIDIA to introduce a few new tricks, including DLSS 3. Along with all of the features of DLSS 2 (which are continually being refined by NVIDIA), DLSS 3 introduces new AI Frame Generation that can increase framerates up to 4 - 5x, according to NVIDIA, in modern game titles optimized to use all of NVIDIA’s new tech.

nvidia editors day 14

Other features inside Ada include a new Optical Flow Accelerator and a new 8th gen media engine, which incorporates dual AV1 encoders.

The AD104 (RTX 4080 12GB), AD103 (RTX 4080 16GB), and AD102 (RTX 4090) GPUs at the heart of the upcoming GeForce RTX 40 series are all manufactured on TSMC’s 4nm process node, which provides a number of benefits over the Samsung 8nm process employed for the RTX 30 series – most importantly, higher transistor density (which translates to smaller die area per transistor), lower voltage requirements, and in-turn, increased efficiency.

nvidia editors day 15

nvidia editors day 16

Thanks in-part to the new architecture and manufacturing process, NVIDIA has packed much more into the RTX 40 series GPUs. Versus Ampere, the biggest Ada GPU has more GPCs, more TPCs, more SMs, and many more CUDA, RT, and Tensor cores. And all of those cores have been enhanced or updated in some way to boost performance. RTX 40 series GPUs are also capable of hitting much higher frequencies, despite a massive increase in transistor counts. The big GA102 GPU used in cards like the RTX 3090 Ti packs just over 28B transistors, whereas the top-end AD102 used in the RTX 4090 is comprised of approximately 76B – that's nearly 3x the total found in Ampere.

All of those additional transistors are used to enable some interesting new features, like Shader Execution Reordering (SER), Displaced Micro-Meshes, Opacity Micro-Masks, and the aforementioned FP8 inferencing, Optical Flow Accelerator and DLSS 3.

nvidia editors day 17

New Rendering Tech On Board Ada Promises Better Image Quality And Performance

nvidia editors day 18
Shader Execution Reordering In Action

Shader Execution Reordering (SER) is essentially a new stage in the ray tracing pipeline. In current architectures, shader programs are typically executed on every pixel lighted by a ray, in the order that they are dispatched through the SMs. The SER capability in Ada, however, is able to rapidly scan through all of the shader programs, and strategically re-order them, so that pixels that are running the same program are grouped together.

nvidia editors day 19

This improves execution efficiency, and ultimately improves performance as well – NVIDIA is claiming up to 2x performance improvement while ray tracing with SER. NVIDIA’s API tools also give developers control of this feature, to best optimize their games.

nvidia editors day 20
Displaced Micro-Meshes reduce the data required to build A BVH.

Historically, GPUs and game engines have used features like tessellation and other tricks to efficiently create geometry in a scene. When ray tracing, current-gen architectures need all of that geometry data in the Bounding Volume Hierarchy (BVH) to accurately calculate the direction of light rays bouncing around the scene, which requires large amounts of compute. Using Displaced Micro-Meshes, however, which leverages new technology in the RT cores to quickly evaluate meshes, the Ada architecture is able to reference just the base triangle data and evaluate how the light rays will be displaced in the higher-detail model. In this example slide, the chunky base structure in the crab is all that is required to determine the high-detail output. Displaced Micro-Meshes reduce the amount of data required to build the BVH, and allow for greater data compression as well, which means the BVH can ultimately be built faster and less data has to be moved and manipulated in the GPU.

nvidia editors day 21

nvidia editors day 22

Displaced Micro-Meshes are designed to work for fully ray traced and hybrid games, and Simplygon and Adobe are integrating the technology into their toolchains, so game developers can incorporate Displaced Micro-Meshes into their typical development processes moving forward.

nvidia editors day 24
Opacity Micro-Masks

Opacity Micro-Masks is a new technology designed to reduce the amount of shader work required to generate a scene. Game developers often use smart tricks to minimize the amount of geometry required to create realistic-looking environments and models. For example, a simple rectangular polygon may have a texture that mixes opaque and transparent elements to simulate a higher detail model and bring more realism to a scene. Consider foliage in a tree, for example. To render each leaf in fine geometric detail would be expensive. But a simple rectangle, with transparent and opaque sections with a representation of a leaf, requires much less horsepower to process.

Previous-gen RT cores have not been able to intelligently handle a situation like this. The RT cores would dish off shader work to the SMs to check the whole rectangle for opacity or translucence, and then the SM had to send that information back to RT core to decide how to trace a ray. Smoke sprites are another example where much more work must be done on existing architectures to determine opacity or translucence of the image.

nvidia editors day 26
Opacity Micro-Masks In Valve Portal RTX Example

In this example slide, RT complexity is increased as more smoke sprites are layered in the scene – red represents more texture ray tests needing to be done. To existing architectures, the geometry looks dense in terms of polygons, but visually it is not. The Ada architecture in the RTX 40 series can compute the rays much more simply and efficiently in situations like this, to increase overall performance.

We should point out, however, that SER, Displaced Micro-Meshes, and Opacity Micro-Masks, are all extensions of DXR, and NVIDIA is working with Microsoft on integration of the technologies.

Ada Powered DLSS 3 With AI Frame Generation

nvidia editors day 30

A new version of DLSS (Deep Learning Super Sampling) is also inbound alongside the RTX 40 series, known as DLSS 3. The GeForce RTX 40 series based on the Ada architecture supports all of the same DLSS 2 features and tech as previous-gen architectures but ups the ante, thanks to much higher performing tensor cores, and a much faster Optical Flow accelerator that enables high-fidelity AI frame generation.

nvidia editors day 38

The Optical Flow engine, e.g. the Optical Flow Accelerator, in Ada is approximately 2 – 2.5x faster than the previous-gen. Optical flow is essentially a search function that helps determine how pixels in one frame correspond to the next, and that data is used to calculate motion flow.

nvidia editors day 39

nvidia editors day 40

In these examples above, the Optical Flow Accelerator essentially “finds the arrows” to define how each pixel is moving. However, these motion vectors can’t be used on their own to help generate full frames without introducing artifacts. Geometric motion vectors alone don’t help much in determining how rays are moving, without also differentiating how objects are moving, and how entire objects actually appear in the game world. For example, shadows usually don’t move in relation to the camera angle. Often you’ll see the road and stands whipping by in a racing game, but the car and/or car’s shadow mostly static. So you have engine motion vectors, and optical flow vectors that provide different data. That data is then fed into an AI to make decisions how to generate a “new” frame with DLSS 3.

nvidia editors day 41

nvidia editors day 42

In these examples, the first and third frame from Cyberpunk are entirely AI generated, and the second is traditionally rendered. In the Racers RTX example it’s reversed – the outer two frames are rendered, but the middle frame is AI generated.

nvidia editors day 37

Along with this new capability, NVIDIA’s AI models have continually been trained and improved to better optimize the existing features in DLSS. If you factor in resolution scaling and AI Frame Generation, there are scenarios where 7 of 8 pixels on-screen are actually AI generated. That’s wild to think about; and if frames are inserted by AI and not actually rendered, is it fair or accurate to say performance has actually been increased? We’re still wrapping our heads around all of this and need to have some brain-melting discussions, but it’s something to ponder...

nvidia editors day 45

nvidia editors day 50

NVIDIA expects DLSS 3-enabled games to start appearing on October, with around 35 titles already in the pipeline.

New NVIDIA Ada Lovelace GPUs Means New Graphics Cards

nvidia editors day 52

As we mentioned in our initial coverage, there are three GeForce RTX 40 series due to arrive first, the GeForce RTX 4090, and 12GB and 16GB variants of the GeForce RTX 4080. Take note, however, those two 4080 cards are actually quite different and are built around different GPUs. In addition to having less memory, the GeForce RTX 4080 12GB has fewer cores and a narrower memory interface. 12GB GeForce RTX 4080s will also exclusively be designed by partners. There will be Founders Edition RTX 4090 and 16GB RTX 4080s, but no NVIDIA-built GeForce RTX 4080 12GB card.

nvidia editors day 53

We should also mention that GeForce RTX 40 series cards are still native PCIe 4.0 – they are not PCIe 5. However, we are still not saturating the current PCIe 4 interface, so the additional bandwidth that would have been afforded by PCIe 5 probably wouldn’t help much, if at all. NVIDIA's RTX 40 series also sticks with DisplayPort 1.4a. According to NVIDIA, they were too far down the development cycle to incorporate DP 2.0, once the standard was finalized.

Something new on the cards that’s been widely leaked and reported is a PCIe Gen 5 power connector. This single connection is able to scale up to 600 watts for power delivery, and shrinks the footprint of the connector versus existing PCIe power connectors. In addition to being able to feed the cards vast amounts of power, NVIDIA has also optimized their VRM for better performance, and the cards’ cooling solutions have been redesigned to better manage the heat that comes with using all that power.
nvidia editors day 58

Like the RTX 30 series, the upcoming GeForce RTX 40 series will also feature compact PCBs. The PCBs, however, can be outfitted with up to 23 power phases (20 for the GPU, 3 for the memory). And full PID transient control is supported, for what NVIDIA claims is a 10x increase in power management response time.

GeForce RTX 40 series cards will respond to workload demands much faster than previous-gen cards, and overall power delivery should be more reliable and consistent. In an example given by NVIDIA, showing how the RTX 3090 Ti and RTX 4090 respond to a sudden increase in workload, the 3090 Ti used higher peak power, and the current fluctuates considerably, whereas the RTX 4090’s does not.

nvidia editors day 59
GeForce RTX 40 Series Founders Edition Card Design

Although the cooling solutions on the GeForce RTX 40 series looks similar to RTX 30 series cards, NVIDIA tells us those too have been redesigned. RTX 40 series cards have higher-performing fans, new vapor chamber designs, and a new heatsink layout. The new design results in 20% more airflow and it pulls heat away from the GPU more efficiently, without generating more noise. By using higher density memory, more power efficient memory (built by Micron on a more advanced process) NVIDIA was also able to situate all of the RAM on one side of the PCB. The new cooler design also makes contact better contact with the memory. All told, memory temperatures should remain roughly 10°C cooler than previous-gen cards.

nvidia editors day 61

All told, there’s a lot of new technology at play with the upcoming GeForce RTX 40 series. GeForce RTX 4090 cards are due to arrive on October 12, with GeForce RTX 4080 cards following shortly thereafter in November.

We hope to test them all soon enough, so stay tuned to HH for more GeForce RTX 40 series coverage in the days ahead. This should be fun to watch unfold. 

Related content