Logo   Banner   TopRight
NVIDIA GF100 Architecture and Feature Preview
Date: Jan 17, 2010
Author: Marco Chiappetta
NVIDIA GF 100 Architecture

Back in late September of last year, NVIDIA disclosed some information regarding its next generation GPU architecture, codenamed "Fermi". At the time, actual product names and detailed specifications were not disclosed, nor was performance in 3D games, but high-level information about the architecture, its strong focus on compute performance, and broader compatibility with computational applications were discussed.

We covered much of the early information regarding
Fermi in this article. Just to recap some of the more pertinent details found there, the GPU codenamed Fermi will feature over 3 billion transistors and be produced using TSMC's 40nm processes. If you remember, AMD's RV870, which is used in the ATI Radeon HD 5870, is comprised of roughly 2.15 billion transistors and is also manufactured at 40nm. Fermi will be outfitted with more than double the number of cores as the current GT200, 512 in total. It will also offer 8x the peak double-precision compute performance as its predecessor, and Fermi will be the first GPU architecture to support ECC. ECC support will allow Fermi to compensate for soft error rate (SER) issues and also potentially allow it to scale to higher densities, mitigating the issue in larger designs.  The GPU will also be execute C++ code.

NVIDIA's Jen-Hsun Huang hold's GF100's closest sibling, Fermi-based Tesla card

During the GPU Technology conference that took place in San Jose, NVIDIA's CEO Jen-Hsun Huang showed off the first Fermi-based Tesla-branded prototype boards, and talked much of the compute performance of the architecture. Game performance wasn't a focus of Huang's speech, however, which led some to speculate that NVIDIA was forgetting about gamers with this generation of GPUs. That obviously is not the case, however. Fermi is going to be a powerful GPU after all. The simple fact of the matter is, NVIDIA is late with their next-gen GPU architecture and the company chose a different venue--the Consumer Electronic Show--to discuss Fermi's gaming oriented features.

GF100 High-Level Block Diagram

With desktop oriented parts, Fermi-based GPUs will here on in be referred to as GF100. As we've mentioned in previous articles, GF100 is a significant architectural change from previous GPU architectures. Initial information focused mostly on the compute side, but today we can finally discuss some of the more consumer-centric details that gamers will be most interested in.

At the Consumer Electronics Show, NVIDIA showed of a number of
GF100 configurations, including single-card, and 2-way and 3-way SLI setups in demo systems. Those demos, however, used pre-production boards that were not indicative of retail product. Due to this fact, and also because the company is obviously still working on feverishly on the product, NVIDIA chose NOT to disclose many specific features or speeds and feeds of GF100. Instead, we have more architectural details and information regarding some new IQ modes and geometry related enhancements.

In the block diagram above, the first major changes made to GF100 become evident. In each GPC cluster--there are four in the diagram--newly designed Raster and Polymorph Engines are present. We'll give some more detail on these GPU segments a little later, but having these engines present in each GPC segment essentially allows each one to function as a full GPU. The design was implemented to allow for better geometry performance scalability, through a parallel implementation of geometry processing units. According to NVIDIA, the end result in an 8X improvement in geometry performance over the GT200. Segmenting the GPU in this way also allows for multiple levels of scalability, either at the GPC or individual SM unit level, etc.

Each GF100 GPU features 512 CUDA cores, 16 geometry units, 4 raster units, 64 texture units, 48 ROPs, and a 384-bit GDDR5 memory interface. If you're keeping count, the GT200 features 240 CUDA cores, 42 ROPs, and 60 texture units. The geometry and raster units, as they are implemented in GF100, are not in the GT200 GPU. The GT200 also features a wider 512-bit memory interface, but the need for such a wide interface is somewhat negated in GF100 in that the GPU uses GDDR5 memory which effectively offers double the bandwidth of GDDR3, clock for clock.

If we drill down a little deeper, each SM core in each GPC is comprised of 32 CUDA cores, with 48/16KB of shared memory (3 x that of GT200), 16/48KB of L1 (there is no L1 cache on GT200), 4 texture units, and 1 PolyMorph Engine. In addition to the actual units, we should point out that improvements have also been made over the previous generation for 32-bit integer operations performance and for full IEEE-754 2008 FMA support. The increase in cache size and the addition of L1 cache were designed to keep as much data on the GPU die as possible, without having to access memory.

The L1 cache is used for register spilling, stack ops, and global loads and stores, while the L2 cache is for vertex, SM, texture, and ROP data. According to NVIDIA, the GF100's cache structure offers many benefits over GT200 in gaming applications, including faster texture filtering and more efficient processing of physics and ray tracing, in addition to greater texture coverage and generally better overall compute performance.

The PolyMorph and Raster Engines in the GPU perform very different tasks, but in the end result in greater parallelism in the GPU. The PolyMorph Engines are used for world space processing, while the Raster Engines are for screen space processing. There are a total of 16 polymorph engines placed before each SM. They allow work to be distributed across the chip, but there is also intelligent logic in place designed to keep the data in order. Communications happen between the units to ensure the data arrives in DRAM in the correct order and all of the data is kept on die, thanks to the chip's cache structure. Synchronization is handled at the thread scheduling level. The four independent Raster Engines serve the geometry shaders running in each GPC and the cache architecture is used to pass data from stage to stage in the pipeline. We're also told that the GF100 offers 10x faster context switching over the GT200, which further enhances performance when compute and graphics modes are both being utilized.

NVIDIA GF 100 Features

Many of the new feature of GF100 are designed to increase geometric realism, while offering increased image quality, and of course high performance. One of the new capabilities that will be a part of the GF100, like other DirectX 11 class GPUs, is hardware accelerated tessellation.

Tessellation Example

The GF100 has built-in hardware support for tessellation. As we've mentioned in the past, tessellation works by taking a basic polygon mesh and recursively applying a subdivision rule to create a more complex mesh on the fly. It's best used for amplification of animation data, morph targets, or deformation models. And it gives developers the ability to provide data to the GPU at coarser resolution. This saves artists the time it would normally take to create more complex polygonal meshes and reduced the data's memory footprint. Unlike previous tessellator implementations, the one in the GF100 adheres to the DX11 spec, and will not require proprietary code.

Hair Demo

To show off the capabilities of GF100, NVIDIA used a number of interesting demos. As many of you know, properly rendering and animating realistic hair is a difficult task. As such, many games slap helmets or caps on characters, if they even have hair at all. NVIDIA's Hair Demo, however, combines tessellation, with geometry shading and and leverages the compute performance of the GF100 to generate flowing hair. The images were realistically lit and smoothly animated, which is a far cry from most of today's current games.

Water Demo

Another demo NVIDIA used to illustrate tessellation with the GF100 was aptly dubbed the Water Demo. As you can see in the screenshots above, the water demo takes a scene with relatively basic geometry, and through increased tessellation and displacement mapping the detail in the rocks and water is dramatically increased. The demo does not use realistic fluid dynamics, but the effect was nonetheless still very good. The difference in performance between the two modes was roughly 2x--with course geometry the demo ran at about 300FPS and with high-detail it ran at about 150FPS.


New GF100 Anti-Aliasing Modes

In addition to offering much more compute performance and geometry processing than previous generations, the GF100 also features new anti-aliasing modes. The GF100 will offer higher AA performance than GT200 not only due to having more ROPsm but because enhancements have been make to each ROP as well. With GF100 the data compression factor is higher in the ROPs, it can use more samples, and it offers better transparency AA quality thanks to accelerated jittered sampling.

Jittered sampling changes the sampling pattern randomly on a per-pixel basis, which help removes banding with noise, and produced an edge that is more pleasant to the eye. The GF100 also offers a new 32x CSAA mode (8x + 24 color samples) in addition to support for 33 levels of alpha blended transparency. The effect of the new AA mode is much smoother edges, as seen in the screenshots above. The new AA mode also preserves more detail on textures with transparency, that are sometimes rendered incorrectly when viewed at angles, like chain-link fence or railing, for example.

NVIDIA GF 100 Featrues (Cont.)

To show off the increased compute performance of GF100, NVIDIA also ran a fully interactive GPU-based ray tracing demo during our briefing at CES.

GF100 Ray Tracing Demo

The ray tracing demo used two identical systems, one equipped with a GF100 prototype board and the other a GeForce GTX 285. And the actual ray tracing demo used an image-based lighting paint shader, ray traced shadows, reflections and refractions running at a resolution of 2560x1600. Frame rates at that high of a resolution were quite low--less than 1 FPS in fact--but the GF100 system showed roughly 3x the performance of the GTX 285 (approximately .063 vs. .023 FPS).

PhysX In Dark Void

Of course, NVIDIA was also keen to demonstrate some upcoming PhysX-enabled titles. The images above are from Airtight's Dark Void, which is due to be released in the US in just a few days. Airtight and NVIDIA jointly worked on the GPU PhysX in Dark Void to implement a Turbulence effect for the in-game jetpack and some weapon effects and impact effects with numerous particles.

NVIDIA's APEX Development Tool

Along with all of the demos, NVIDIA also spent some time talking to us about "The Way It's Meant To Be Played" program and some of the new tools and support being offered to developers. NVIDIA talked of their immense game testing labs which developers in the program have access to, the Technical Design Documents offered to developers, and the many SDKs NVIDIA has made available over the years. One of the newer tools being shown off is called Apex. NVIDIA calls APEX a “Scalable Dynamics Framework” that consists of authoring tools and a runtime. It acts like a plug-in for many popular tools, and while using APEX we watched as PhysX effects were literally painted onto a model. APEX was used during the development of Dark Void and the upcoming game Metro 2033.


Supersonic Sled Demo

Perhaps the most complex demo NVIDA used to showcase GF100 was the Supersonic Sled. A system equipped with three GF100 cards was used to run the demo, which exploits virtually all of the features of the GPU. The Supersonic Sled Demo uses GPU particles systems for smoke, dust, and fireballs, PhysX physical models for rigid bodies and joints, which are partially processed on the CPU, tessellation is used for the terrain, and image processing is used for the motion blur effect. NVIDIA called the demo the "kitchen sink" because physical simulation, DX11 Tessellation, environmental effects, and image processing are all employed simultaneously.

In the demo a pilot is launched down a track on a rocket-propelled sled and general mayhem ensues. Particles are strewn about and objects like a shack, bridge, and rock ledge crumble as the sled jets by. Hundreds of thousands to a million particles can be on the screen at any given time, all being managed by the GPU. The demo requires an immense amount of compute performance to run smoothly with the detail and number of particles cranked up, hence the GF100 3-way SLI configuration.

There were other GF100 demos at CES as well, including 3D Surround--which we showed you here--and a side-by-side FarCry 2 benchmark run which showed GF100 running roughly 65% faster than a GTX 285 at 1920x1200 (84 FPS vs. 50.4 FPS). All told, we wish we had more specific detail regarding GF100 to share with you today. And we know NVIDIA feels the same. For now, we'll all just have to wait a little longer and hope that NVIDIA hits their current Q1 2010 release target.

Content Property of HotHardware.com