Intel Lunar Lake Deep Dive: Xe2 Battlemage Graphics Engine

We're just going to come out and say it: the first generation of Intel Xe was all over the place. There were many variations on the architecture, and it appeared in various forms across a huge variety of products. It had implementation difficulties and design problems, both resulting from being a first-generation product. It happens to everyone their first time.

intel xe2 gpu architecture 3

Intel says that Xe2 is its new architecture, and that there are no suffixes like "LP" or "HPG"—it's just Xe2. This architecture is debuting in Lunar Lake, but it's also going to power the company's next-generation discrete GPUs, known as "Battlemage". Intel fellow TAP told us that between Lunar Lake's IGP and Battlemage, it's a "difference in transistors, not in architecture."

Lunar Lake's Powerful Xe2 Battlemage iGPU

According to its creators, Xe2 has "dramatically higher" compatibility and utilization in most workloads. High utilization means high efficiency, which is of course, the name of the game today. The slide below shows a variety of micro-benchmarks for Xe2 with multipliers compared against the original Xe architecture. As you can see, everything is improved, but some things much more than others.

intel xe2 gpu architecture 4

Those first two bars are "Compute Dispatch XI" and "Draw XI". In this case, "XI" stands for "eXecute Indirect", a critical feature of DirectX 12 and Vulkan. Without getting too into the weeds, this is a method of sending commands to the GPU where, instead of having the CPU directly issue every single command, you batch them all up into an indirect buffer and then send an "ExecuteIndirect" command which tells the GPU to go ahead and do all that stuff.

Even though indirect execution has been a part of DirectX since Direct3D 11.0, as it happens, the original Xe architecture didn't support it in hardware—it has to be emulated in the driver. This is one of the biggest reasons that the performance of Arc GPUs is so inconsistent across a wide variety of titles. As you've probably guessed from the micro-benchmark results above, Execute Indirect is supported on Xe2.

intel xe2 gpu architecture 5

Xe2 is described as being a "very straightforward modular architecture." The GPU is divided into one or more render slices, which includes three or more Xe Cores as well as fixed-function graphics hardware that handles things like geometry, texturing, and rasterization. On Xe2 as implemented in Lunar Lake, there are two render slices, each with four Xe Cores.

intel xe2 gpu architecture 24

This diagram outlines the overall layout of the integrated Xe2 GPU for Lunar Lake. You've got eight Xe cores, each with eight Xe Vector Engines that, unlike the Xe-LPG integrated GPU on Meteor Lake, includes one XMX matrix math unit. Each Xe Core is also paired with a ray-tracing unit, and then it all feeds through the fixed-function stuff to the render backend.

intel xe2 gpu architecture 7

The Xe Core in Xe2 is very different from that in the original Xe architecture. Where that design used 8-wide SIMDs, Xe2 has been rejiggered to use 16-wide SIMDs. This is much more similar to contemporary GPU architectures, and this change alone offers a significant improvement in game compatibility, requiring fewer driver shims to make games work properly.

intel xe2 gpu architecture 8

The XMX units built into the XVEs are very powerful, supporting the ability to complete 2048 FP16 ops per clock or a massive 4096 ops/clock in INT8 math. These are, obviously, intended for AI math, including (but not limited to) XeSS upscaling. Of course, these are also critical to the performance of Xe2 when this architecture gets adapted to the role of server-borne AI accelerator.

intel xe2 gpu architecture 9

This chart shows how to calculate that "TOPS" number everyone throws around, because it depends very heavily on the data type that you're working in. Most people throw around INT8 TOPS numbers, and to find that value, you multiply the number of relevant cores (say, XMX units) by the clock rate and then the "Ops/clock" number found in this chart.

intel xe2 gpu architecture 25

Talking specifically about the Xe2 GPU in Lunar Lake, Intel says that it can achieve around 50% better performance than the previous generation "U" processors at the same power, which is phenomenal. It also can obviously achieve the same performance with much lower power, but perhaps more notably, it can handily outperform the higher-power Meteor Lake H SoC's Arc GPU, as well.

intel xe2 gpu architecture 29

This slide goes over the display engine for Lunar Lake, which Intel emphasized is specific to Lunar Lake, not Xe2. In other words, this information doesn't apply to other products, like Battlemage. Lunar Lake in particular supports three displays simultaneously, which can be three DisplayPort or HDMI connections, or two and then a separate connection over eDP 1.5 for a laptop's internal display. It supports triple 4K60 monitors, or refresh rates up to 360 Hz. Not bad.

intel xe2 gpu architecture 35

Speaking of eDP 1.5, it's quite a novelty, and Intel is the first to support it. It's coming along with a fair few new fancy features, including selective update with early transport as well as adaptive sync with panel replay. The former, selective update, refers to a process where the display can skip fetching and transmitting repeated parts of frames, while "panel replay" is a feature that allows the screen to simply show the same frame again if there's not a new one ready. Even better, though, is the ability to use VRR to match content refresh rates, which is also supported.

intel xe2 gpu architecture 44

Combining all the technologies that Intel has created for the display engine, including panel replay, a hardware flip-queue, early display engine wake, display clock decoupling, and more, we can see considerable power savings during a variety of common activities. The biggest gains are when watching YouTube full-screen, something people are likely to do for hours on end.

intel xe2 gpu architecture 46

Lunar Lake's media engine also sees major modifications, including the addition of a memory side cache. This serves as a system-level cache, but we're mentioning it here because it offers huge power savings for media decode and encode tasks. It's an 8MB cache that we believe essentially caches anything coming out of RAM, keeping the machine from having to step out to the relatively-power-thirsty main memory.

intel xe2 gpu architecture 47

The media engine itself is quite capable in terms of functions, although we do have to note that since Lunar Lake only has a single MFX, it only supports a single concurrent video stream. Still, that stream can be up to 8K60 whether encoding or decoding, and 10-bit-per-channel HDR is fully supported. The revised media block also adds support for the new H.266 Versatile Video coding standard, although only decode, for now.

intel xe2 gpu architecture 57

Of course, drivers are a critical part of graphics performance, and Intel has well proven its commitment to improving its graphics drivers. As the company said, its DirectX 9 driver was based on a completely different paradigm two years ago, and performance across the board has steadily improved over the 27 months since the launch of the Xe architecture. Intel assures us that the driver situation for Xe2 at launch will be completely unlike the rocky state of the graphics software for Alchemist, and we believe it.

Related content