AMD Touts Asynchronous Shader Technology In Its GCN Architecture
As the name suggests, Asynchronous Shaders alludes to a GPU’s ability to execute shader instructions independently and out of sync. The technology leverages a trio of workload queues, to effectively handle multiple streams of work and more efficiently utilize available GPU resources with tasks that can be parallelized.
As it stands today, even though many graphics processing tasks are parallelizable, they are executed sequentially from within a single queue and in a pre-determined order. With Asynchronous Shaders, however, three different queues are available—a Graphics queue, a Compute queue, and a Copy queue—and tasks from the different queues can be scheduled independently. The Graphics queue is for handling primary rendering tasks, the Compute queue is for things like physics, lighting, and post-processing effects, and the Copy queue is for data transfers.
In the example in the slide above, there is a single queue in DirectX 11 for physics, lighting, and data transfers, which is executed synchronously in a single stream that takes X amount of time. Each of these effects uses different GPU resources, however, and executing one command doesn’t necessarily need the results from a previous command to be properly completed, so there’s no technical reason—other than the limitations of the API—that they can’t be run concurrently.
With Asynchronous Shaders and Direct X12 (or Vulkan, or Mantle), the tasks are broken up into the three available queues and parallelized. The end result is that all of the same work can be done in less time, which can reduce latency, boost frame rates, or both. GPU resources are more efficiently utilized and performance is effectively increased.
This example shows how a modern game engine may utilize the available Graphics, Compute, and Copy queues for rendering tasks, compute workloads, and data transfers. AMD claims that the model represented here is already familiar to developers that optimize their engines for game consoles. And as was the case in the earlier example, all of the queues can be executed in parallel to better use GPU resources and maximize the potential performance improvement.
Leveraging Asynchronous Shaders is possible with AMD’s GCN-based GPUs thanks to something AMD is calling Asynchronous Compute Engines, denoted by the ACEs flanking the command processor in the block diagram above. (Along the very top of the diagram)
There can be up to 8 Asynchronous Compute Engines per GPU in current Hawaii-based products like the Radeon R9 290X, and each ACE can manage up to 8 queues, all operating in parallel with the command processor. The Asynchronous Compute Engines have access to the GPU’s L2 cache and Global Data Shared cache, and offer fast context switching as well.
In a demonstration from AMD’s LiquidVR SDK, a scene is processed with Asynchronous Shaders and Post Processing Effects disabled, and it renders at about 245 frames per second. Turn on the Post Processing Effects with Asynchronous Shaders still disabled and the frame rate drops to 158 FPS. Still fast, but that’s obviously a significant drop in performance. With Asynchronous Shaders turned on, however, performance jumps back up to the 230 FPS range.
Asynchronous Shaders have many applications over and above Virtual Reality, and are already being used on the PS4 in games like Battlefield 4, InFAMOUS Second Son, and The Tomorrow Children, and on the PC in the Mantle-version of Thief, but they can be particularly useful for things like asynchronous time warp to minimize head-tracking latency by executing the VR image processing in parallel with the actual scene rendering. Asynchronous time warp is also a capability of NVIDIA’s Maxwell-based GPUs; we discussed it and a handful of other features related to NVIDIA’s VR Direct technologies in our initial coverage of the GeForce GTX 980 launch.
There are many potential benefits to Asynchronous Shaders. Because many serial workloads can be broken up and parallelized, GPU resources that would normally remain idle for relatively long periods can be more fully utilized, to execute workloads faster and generally improve performance, both in terms of latency and frame rate. Lower latency is an obvious plus for VR applications because it can make the user feel more connected to the scene, by minimizing the lag associated with accurate head-tracking. And improving the effective performance of a GPU could enable developers to use additional graphics effects to ultimately enhance image quality, while still hitting a specific performance target.