How AMD's HSA Queuing Technology Simplifies GPU Acceleration

One of the greatest problems standing between companies like AMD and widespread adoption of the GPU as a mainstream accelerator is that it's extremely difficult to effectively leverage the GPU. Even after years of development poured into CUDA, OpenCL, and yes, HSA, the barriers between CPU and GPU have remained substantial. The reason why is simple, at a high level. The CPU was always designed as the Central Processing Unit. As years past, and more and more applications that were previously handled by specialized accelerators moved to the CPU, there was less and less need to treat any other processor in the system as a partner. AMD's Heterogeneous System Architecture aims to reverse this, but creating a framework for doing so has taken a great deal of time.

Bit by bit, the company has pulled back the curtains on various aspects of its HSA technology, and today it's explaining another facet -- HSA Queueing, or hQ (clever) for short. Here's how it works:



This is one of the core problems preventing the widespread adoption of GPGPU -- it simply takes too much time to route all of the calls and processing through the CPU. A better way to think of it is this: Imagine that the GPU can perform a task 3x faster than a CPU, but it takes 10x longer to handle all the message passing back and forth between the application, the CPU, and the GPU than it would to simply do the task on the CPU to start with. Since data passing and thread management are only one component of performing a task, it may still be faster to use the GPU -- but the performance benefit inevitably shrinks, from say 3x to 1.5x. Since it's inevitably more difficult to optimize code for this kind of specialized co-processing, most developers aren't willing to commit to doing it for "merely" a 50% improvement.

What hQ does is allow applications to dispatch directly to the GPU, while treating GPU and CPU as equal partners. Now, the CPU and GPU can both be dispatched to, as shown below.



By allowing the application queue direct access to both CPU and GPU resources, there's no need to end up waiting on the CPU to send data flitting out back and forth. It can maintain multiple queues for both CPU and GPU, and there's no need to write a low-level driver to interface between them. This is part of what's built into HSA-compliant hardware, and as AMD notes, there's no kernel overhead for this kind of processing.

Bringing down latency and simplifying operation is essential to AMD's efforts to get HSA adopted widely. Heterogeneous computing is widely viewed as one of the major ways companies like AMD, Qualcomm, and Samsung will accelerate their products in the next few years, which makes these developments particularly important.
Tags:  AMD, GPU, CPU, GPGPU, APU, HSA, Queueing, hQ