Arm Mali-G76 and V76 Architecture Details
Arm’s graphics team has eschewed brand new architecture redesigns for the time being and have instead opted to refine the venerable Bifrost architecture. Bifrost was introduced with the Mali-G71 high-end GPU. The Mali-G76 now brings it into its third generation.
The G76 GPU is intended to be as at home with gaming and VR as it is with machine learning. In like iso-process and frequency conditions, the G76 improves performance density and energy efficiency by 30% over the previous generation G72 while providing a 2.7x uplift in machine learning workloads.
According to Arm, thermals have become the limiting factor for this architecture. The G76 is built upon a 7nm process node, shrunk from 10nm for the G72. Arm estimates a performance improvement of 20% from this die shrink alone. Combined with the 30% performance density gain from architectural improvements referenced above, Arm pins the total performance uptick from the G72 at around 1.5x.
Architecturally, the G76 leverages three execution engines per shader core. Each execution engine now has 8 execution lanes as well, up from 4 each in the G72.
While the maximum number of cores is lower at just 20, down from 32, the total number of execution lanes has increased by 25% as a result. The execution units now have int8 dot product support for improved machine learning performance as well.
The increase to 8 execution lanes per engine has further benefits for efficiency. The workload portion is identical in terms of power, but the control and cache requirements are significantly reduced by this new configuration.
The texture unit has been upgraded from the G72 as well; it is now a dual texture unit to keep pace with the execution engine improvements. It is able to process 2 texels per cycle as a result. Arm noted that other optimizations are primarily to accommodate the execution engines' higher throughput.
Finally, Arm has released a new video processor, the Mali-V76. VPU’s are purpose-built processors to accelerate encoding and decoding of video.
The V76 improves on the recently announced V52 but is not intended to outright replace it across all market segments. The V76 is really geared towards the next generation of 8K Ultra HD content, like planned coverage of the upcoming 2018 Winter Olympics in PyeongChang.
Perhaps 8K sounds like overkill when a majority of video content is still being produced in good old 1080p FHD, but it does carry other practical advantages. For instance, Arm’s V76 can support an 8K video wall comprised of sixteen individual 1080p video streams at 60 FPS. It can also be divided more piecemeal, with two 4K streams and eight 1080p streams as an example. These streams could be run across multiple displays or a single screen, but either way the ability to power the setup with a single chip is very attractive to simplify configurations and reduce cost.
Arm also briefly highlighted improvements for VR with their new chip. Manufacturers are struggling to power high enough resolutions to maintain a convincing and immersive experience in head mounted displays. The V76's ability to decode 8K60 video is invaluable for video-based experiences. The V76's versatility to also deliver high-framerate 4K footage does not hurt either.
In many respects, the V76 performs on-par with the V52 on a per-core basis. Both solutions can encode a 1080p60 stream or decode a 4K30 stream with a single core. Both can either encode a 4K60 stream or decode a 4K120 stream with four cores running. However, the V52 maxes out here at four cores, while the new V76 can scale up to eight cores. This enables 8K30 encode and its impressive 8K60 decode capabilities. The V76 operates with the same mix of codecs as the V52 -- most notably HEVC, VP9, and H.264p all with 8- and 10-bit support.
Unlike the V76, however, the V52 is only intended for mainstream solutions. Instead, the V61 serves as a better comparison for Arm's high end offerings. In this respect, the V76 provides double the decode performance and up to 25% improvement in encode quality.
This quality metric is measured by comparing encoded video frames to their raw counterparts and checking the peak signal-to-noise ratio that arises from the encoding process.
Arm further improved video quality through sharing data with their display processor to make scene aware local tone mapping possible. Video encoders often take advantage of unchanged pixels from one frame to the next by essentially not redrawing them. When a scene change occurs, it brings a spike in data to redraw nearly every pixel. Given a static bitrate, this results in extra artifacting for the first few subsequent frames until the codec's compression is able to catch up. With the improvements to local tone mapping, the V76 is able to reduce the impact of these artifacts by 30%.
Finally, the V76 improves performance density over the V61. Using a 4K120 optimized setup as an example, Arm says the V76 requires 40% less die space. Most of this space savings stems from the V76's ability to decode 4K120 on just four cores to the V61's eight core configuration.