When 64-bit Isn't The Answer: Diving Into iPad Air and Apple A7 3DMark Performance
In 3DMark Ice Storm, the iPhone 5S is significantly faster in GPU workloads -- almost 3x as fast, in fact -- but its CPU performance is actually slightly slower than the A6, as measured in the Physics test. The iPad Air shows exactly the same performance issue, only its CPU is clocked faster than the iPhone 5/5S, and shows a small improvement as a result. The 3DMark team sat down to tease out what the problem was in 32-bit code, investigate whether shifting to 64-bit code would fix the problem, and explore why the A7, which is generally much faster than the A6, shows no improvement here.
Benchmark Data, Credit: Futuremark
What they found is a fabulous example of how major improvements on paper, like the shift from 32-bit to 64-bit, can be overshadowed by small issues in practical implementation.
Teasing apart the data
Moving the code to 64-bit improved the A7's performance by about seven percent. That's actually fairly good, and in line with what we often saw when AMD and Intel chips were making the same transition. What it doesn't answer, however, is why there's no performance delta between the two CPUs in 32-bit code. For this, the Futuremark team had to dig deeper. The difference, it turns out, is tied to the open source Bullet physics library that 3DMark Ice Storm relies upon for testing CPU performance.
From the article: "The purpose of the Physics test is to measure the CPU's ability to calculate complex physics simulations. The Ice Storm Physics test has four simulated worlds. Each world has two soft bodies and two rigid bodies colliding with each other. This workload is similar to the demands placed on the CPU by many popular physics-based puzzle, platform and racing games." The point of using Bullet in the first place is to ensure cross-platform compatibility using a library that anyone can check for unfair optimizations or tweaks that favor a particular microarchitecture.
Since the CPU physics tests spends most of its time in one particular code area, PSolve_links(), it made sense to duplicate that code's functionality and test it in a stand-alone application. Doing so showed the A7 performing markedly faster than the A6, which makes precious little sense. The difference between the stand-alone implementation of PSolve_links() and the version inside the Bullet library is the way memory data is stored and accessed. If the data arrays used to solve the physics problems are stored sequentially, the A7 is fast -- very fast. If they're distributed randomly, it took a significant performance hit. The A6, in contrast, runs about the same speed in both cases.
The other performance difference between the iPhone 5 and the 5S was the use of a particular data dependency -- a point within the code where the CPU cannot speculatively prefetch or execute because it's waiting on a result from the program that's vital to the performance of the application. Speculative prefetching is a technique that can boost performance significantly in certain circumstances but is also a bit of a power hog -- you're burning energy executing results or fetching data that may or may not actually be needed. The fact that the chip's performance is tied to sequential memory accesses is more interesting -- data dependencies are a known factor that can hamper performance on any architecture.
Benchmark Data, Credit: Futuremark
Total performance on 3DMark's physics test rose 17% in the customized version of the application, from 7783 to ~9100 points, as compared to the A6's score of of 8197. Changing how the Bullet library stored data in memory had no impact on the Apple A6 SoC and a small negative impact on Qualcomm-powered devices. While this doesn't dramatically change how the iPhone 5S ranks in 3DMark, it shows how the advantage of a big change (32-bit to 64-bit) can actually be much smaller than the impact of a low-level optimization that better matches how a CPU best performs a task.
Futuremark has stated it has no plans to release this specifically optimized version for Apple devices, as it would slightly penalize Android products and the company does not optimize its benchmarks for any specific product. While we agree that such optimizations are very useful when it comes to understanding the performance of the SoC, the code itself is written to target how game authors typically design code and the degree of diagnostic optimization is unlikely to be present in a shipping product.