|In Part I of this series, we discussed ARM's business model and how it works with its various partners as compared to Intel. Today, we're diving into a specific technology that ARM believes will allow it to differentiate its products and offer superior performance to Santa Clara and the upcoming 22nm Bay Trail.
big.LITTLE is ARM's solution to a particularly nasty problem: New process nodes no longer deliver the kind of overall power consumption improvements that they did prior to 2005. Prior to 90nm, semiconductor firms could count on new chips being smaller, faster, and drawing less power at a given frequency. Eight years ago, that stopped being true. Tighter process geometries still pack more transistors per square millimeter, but the improvements to power consumption and maximum frequency have been falling every single node. Rising defect densities have already created a situation where -- for the first time ever -- 20nm chips won't be cheaper than the 28nm processors they're supposed to replace. This is a critical problem for mobile, where low power consumption is absolutely vital.
big.LITTLE is ARM's answer to this problem. The strategy requires manufacturers to implement two sets of cores -- the Cortex-A7 and Cortex-A15 is the current match-up, though long term, a wide variety of options are possible. The idea is for the little cores to handle the bulk of the phone's work, with the big cores used for occasional heavy lifting. ARM's argument is that this approach is superior to dynamic voltage and frequency scaling (DVFS) because it's impossible for a single CPU architecture to retain a linear performance/power curve across its entire frequency range. This is the same argument Nvidia made when it built the Companion Core in Tegra 3.
In theory, this gives you the best of both worlds. Actual implementation, unfortunately, has proven to be a bit more complicated.
Implementing big.LITTLE in Software:
There are three ways to build a big.LITTLE design. The first and simplest is cluster migration. When load on one cluster hits a certain point, the system transitions to the other cluster. All relevant data is passed through the common L2 cache, one set of cores powers down, and the other powers up. This is transparent to the OS, which always sees just four cores. The problem with this approach is that a poorly tuned scehduler can leave substantial power savings on the table. If the big A15 cores wake up too early, workloads that could have run on the low-power Cortex-A7's end up on the A15's.
The second model is CPU migration. In this model, each big core is virtually paired with a little counterpart. If the system detects a high load on LITTLE CPU 0 (A7) it ramps up big CPU 0 (A15) and moves the workload over to the larger core. Again, no more than four cores are active at any given time, but this allows for fine-grained control.
The third model is the long-term goal: A global task scheduler. This requires an intelligent software scheduler that sees all cores simultaneously, understands which workloads are best suited to run on which cores, and can schedule them appropriately. Combined with HSA, this allows the system to maximize performance in virtually any workload. It takes less time to transfer data between cores and it's possible to build non-symmetric processor layouts. This last one is a crucial feature. In the first two types of big.LITTLE designs, cores must be implemented 1:1, with one A15 for every A7 and vice versa. A Global Task Scheduler frees this constraint
The advantage to a global task scheduler is that you no longer take a mandatory hit when switching between clusters (it takes a non-zero amount of time to transfer data) and you can use all cores simultaneously. Unlike cluster and CPU migration configurations, a global scheduler can use asymmetric ARM configurations. Want a quad-core Cortex-A7 with a dual-core A15? You can have that. Want an A5, two A7's, and one A15? You could have that, too.
|ARM vs. Intel: Who Has the Superior Solution?|
|To date, Intel has eschewed a big.LITTLE approach in favor of DVFS -- Dynamic Voltage and Frequency Scaling. As the name implies, Intel uses DVFS to reduce power consumption by dropping the CPU into the lowest possible power state, transitioning out of that state when needed, and returning to it when the need for additional performance has dropped off again. One of the problems with comparing the two approaches is that "performance," in this case, refers to task completion time, total power consumed during the completion of that task, and how effectively the operating system manages the power conservation features of the CPU.
Samsung's Exynos Octa was supposed to be big.LITTLE's major debut, but all available evidence suggests that the CPU's implementation is broken. That would explain why Samsung's flagship, the Galaxy S4, recently launched with a Qualcomm processor inside the US version, with the much-touted Exynos 5 Octa relegated to the international versions of the phone and the Korean model. Reports indicate that the CCI-400 (Cache Coherent Interconnect) module that makes big.LITTLE possible is disabled on the device and can't be enabled via software.
As far as triumphant debuts are concerned, that's problematic -- but it doesn't say anything about the underlying usefulness of big.LITTLE as a whole. ARM showed us demos of asymmetric configurations in action, and it's clear that chips that implement a GTS can save power compared to those that don't.
big.LITTLE MP (Global Task Scheduling on an asymmetric core implementation. Cumulative energy for a conventional dual-core vs three A7's + two A15's is shown in the center.)
The other fact that's worth pointing out is that while big.LITTLE is an alternative to the kind of frequency and voltage scaling that Intel uses, ARM processors are compatible with DVFS techniques as well. A manufacturer like Qualcomm or Samsung could implement a chip to use a DVFS approach rather than a big.LITTLE option, or could even implement both. Again, this is somewhat dependent on available foundry technology from TSMC or GlobalFoundries, but it's far from impossible.
So where does that leave us? Waiting for the next round of products, on both sides. Intel unquestionably needs Bay Trail to be a major success story; continuing softness in the PC market threatens the company's bottom line and it needs to demonstrate a chip that can compete squarely on ARM's turf. big.LITTLE, meanwhile, only has a few adoptees. That will likely change once the relevant patches are streamed into Android and software support picks up, but this sort of chicken-and-egg scenario is always a slow process.
|Conclusion: Is big.LITTLE a kludge?|
|The last point we want to talk about is one that ARM hotly refutes -- the idea that big.LITTLE, far from being a vital part of the company's strategy, is a patch designed to fix a problem. Problem being: The Cortex-A15 is more processor than modern smartphones can handle very well. There's some accuracy to that last statement -- the Nexus 4 runs faster if you drop it in a freezer. Cortex-A9 designs didn't really have that problem.
ARM's literature always made it clear that the Cortex-A15 wasn't really meant for most phone applications; the company initially showcased it as fitting into the top of the smartphone line, but with a greater role to play in tablets and other larger devices. All the same, much of the work being done on the Cortex-A15 is focusing on bringing power consumption down. That's because vendors that want to use ARM's own design rather than building a Cortex-A15 "class" product like Qualcomm's Krait or Apple's Swift, need a design that can compete with those chips in both areas.
After talking to ARM and looking over the data the company provided, I don't think big.LITTLE was a last-minute attempt to shoehorn Cortex-A15 into lower power envelopes. I suspect instead that big.LITTLE's implementation has proven more difficult than ARM anticipated, while the first Cortex-A15 products also drew somewhat more power than first anticipated. If we step back and look at the big picture, however, major shifts in microprocessor technology typically don't land gracefully.
Once Global Task Scheduling is functional, it can be used to boost performance over and above the A15 cores alone
Intel's first out-of-order microprocessor architecture, the Pentium Pro, debuted to lackluster sales. Windows 2000 and Windows XP didn't properly support Intel's SpeedStep initially, and Hyper-Threading wasn't fully supported until Windows XP SP1. AMD debuted its new 64-bit architecture in 2003, but the market didn't start seriously adopting 64-bit Windows until 2009. It takes time, and effort, for software development to catch hardware, and so it's not particularly odd that big.LITTLE adoption is a bit clunky at the moment.
Once Bay Trail and a refreshed set of big.LITTLE devices ship (hopefully with GTS support directly baked into the kernel), we'll be able to get an idea of where these two approaches stand when compared against each other. big.LITTLE has real promise in the long run, but adoption will hinge on a great many factors -- some of which ARM can't control.