Intel Lunar Lake CPU Deep Dive: Chipzilla’s Mobile Moonshot

Intel Lunar Lake Deep Dive: Lion Cove P-Cores For Peak Performance

Lion Cove is, as the name implies, is the successor to Redwood Cove as implemented in Meteor Lake. Redwood Cove was itself a minor upgrade over Raptor Cove as implemented in Raptor Lake, but Lion Cove is a dramatic shift. In Intel's own words, the goal was to "remove any transistor from the design that doesn't directly contribute to product goodness." The company says it took the opportunity afforded by Lunar Lake to "fundamentally resolve some microarchitecture roadblocks" that had been present in the "Cove" lineage for some time.

Lunar Lake's Lion Cove P-Cores Explained

intel lion cove p core lunar lake 2

In so doing, the company targeted three main goals: single-threaded performance and area efficiency, a full overhaul of the microarchitecture offering future scalability, and modernization of its design database allowing Intel to iterate on the design more readily for future products. We'll go over Intel's changes in each of these three categories.

intel lion cove p core lunar lake 5

One of the biggest changes in Lion Cove is, as rumored, the removal of Hyper-Threading. This familiar feature came to x86 with the Pentium 4 HT, and Intel explains that in an environment with a high number of threads in use, Hyper-Threading can still give 30% IPC uplift for 20% power at the same voltage and frequency. That's a very solid gain, and as a result, Hyper-Threading is going to hang around in your big P-core-only server parts.

intel lion cove p core lunar lake 6

Indeed, it's really the presence of E-cores that makes Hyper-Threading less attractive. Due to the way tasks are scheduled, it's usually both more efficient and even more performant to schedule a task on an E-core instead of trying to make use of the hyper-threads on a P-core. Because of this, Intel's typical scheduling paradigm was to fill up P-cores first, then use E-cores, and then finally start scheduling things on the hyper-threads. (The scheduling paradigm is totally different now on Lunar Lake, but we'll talk about that later.)

intel lion cove p core lunar lake 7

Since we're usually only scheduling one thread per P-core, that means there's a ton of silicon area wasted on Hyper-Threading. It doesn't come for free; not only do you need additional logic to handle the second thread, but there's all this support logic too, for thread scheduling and security purposes. Ripping out Hyper-Threading allows Intel to make the P-core significantly denser and more efficient, but Intel didn't stop there.

Lion Cove, at least as implemented in Lunar Lake, also drops all silicon support for the company's Transactional Synchronization Extensions, Advanced Matrix Extensions, and "various other features". Again, the company was very serious about that "remove any transistor from the design" paradigm. The P-core is stripped down strictly for single-threaded speed.

intel lion cove p core lunar lake 9

In addition to those changes, Intel also made big changes to the frequency management features of the CPU core. Rather than using pre-set static thermal guard-bands, Lion Cove features an "AI self-tuning controller" that the company says can adapt in real-time to conditions including workload, thermal solution, ambient temperature, and so on. It allows tighter frequency convergence, and in combination with an increase in clock scaling granularity from 100 MHz to 16.67 MHz intervals, the end result is higher sustained clock rates and thus performance.

intel lion cove p core lunar lake 12

On the microarchitecture overhaul side of things, Intel made massive changes to the processor's design, widening it all the way through. The branch prediction block is "up to eight times wider" than in Redwood Cove, and this enables the branch predictor to run ahead and prefetch code lines. Intel says that instruction cache request bandwidth was tripled to capitalize on this, while instruction fetch bandwidth was doubled to 128 bytes per cycle.

Meanwhile, decode bandwidth increased from 6 to 8 instructions per cycle, while both the micro-op cache and queue grew considerably: cache from 4000 to 5250 micro-ops, and queue from 144 to 192 entries. Both of these changes were primarily motivated by efficiency concerns, not performance. If ops are in the cache, it doesn't have to power up the fetch/decode logic, and the larger micro-op queue allows the chip to support longer code loops.

intel lion cove p core lunar lake 13

One of the biggest changes is a split in the Out-of-Order-Engine. Intel has separated the renamers and scheduling into dedicated integer and vector domains. This allows power savings in domain-specific workloads, but the real benefit is that it allows Intel to modify these domains in future designs without rejiggering the whole structure.

intel lion cove p core lunar lake 14

intel lion cove p core lunar lake 15

intel lion cove p core lunar lake 16

As we noted earlier, the whole CPU core is wider in Lion Cove. The out of order engine gets between 25% and 50% wider across the board, while both integer and vector domains grow in capacity. In particular, the integer block can now perform three 64-bit integer multiplications simultaneously, and the number of 256-bit FP dividers grows from just one in Redwood Cove to two in Lion Cove. It also gains a fourth SIMD ALU.

Lunar Lake's Revamped Memory Subsystem

intel lion cove p core lunar lake 17

Lion Cove additionally features a major rework of the cache hierarchy in the processor. Intel says it "re-architected" this part of the CPU core to reduce the average cache latency. To that end, the old L1 cache is now considered an L0 cache with access latency reduced by 20%, while the new 192K L1 cache sits between the L0 and L2 in terms of latency. The L2 cache also grows from Redwood Cove, expanding to 2.5MB in Lunar Lake, while the implementation of Lion Cove in Arrow Lake will apparently have a full 3MB of L2 cache.

intel lion cove p core lunar lake 19

This slide shows the fruits of Intel's labors, at least by its own metrics: an average of 14% performance gain clock-for-clock, and over 18% better performance than Redwood Cove (as implemented in Meteor Lake) at the lowest power limits. The performance advantage falls off slightly at higher power levels, but remains in the double digits even near the top of Lunar Lake's power band.

intel lion cove p core lunar lake 21

Finally, here we see the change in Intel's design philosophy that has taken effect with Lion Cove. On the left, we have a jumble of hand-drawn functional blocks, or "fubs", while on the right we have a collection of what Intel calls "synthesis-based partitions of hundreds of thousands to millions of cells."

The short version is that this reduction of artificial physical boundaries in the design has lead to better efficiency both in terms of power and area. It also has offered a shorter "hardening time," which means Intel needs less time between iterations of the P-core design. Indeed, Lunar Lake's "Lion Cove" is apparently different in "several aspects" from theoretically the same CPU core as implemented in Arrow Lake, coming later this year.

Lion Cove is only one part of the big changes in Lunar Lake. Let's take a look at the other x86 architecture in Intel's new chips: Skymont.

Related content