The Titan supercomputer buildout at Oak Ridge National Laboratory was a much-publicized upgrade of the older Jaguar supercomputer that already existed on-site. Nvidia and AMD heavily publicized the facility's decision to combine Opteron processors with Nvidia's new K20/K20X-based graphics cards. When it launched the GTX Titan
last month, Nvidia told us that one reason it settled on that name was because of the association with the Titan project.
That deployment, it turns out, isn't working like it should. Stability problems have kept the supercomputer from passing final stability tests. The system needed to hit a 95% successful completion rate in a 14-day series of stress tests, but only managed 92-93%. It took the troubleshooting team several months of testing to narrow down the problems.
The problem, it turns out, is the gold-tin mixture in the solder used on Cray's motherboards. There's some indication that the traces impacted are in the GPU
socket; GPU-CPU communications have been adversely affected. It's a claim that makes sense, given what we know of gold-tin solder. If too much gold is used in the solder mixture, the solder can become brittle. The recommended limit is 3% gold by weight, though some evidence suggests that solder can become brittle at even lower percentages.
Here's an example of badly brittled solder at 3400x magnification. It looks like there are plates and ridges within the solder. This is more or less the opposite of what you want a joint to look like (and in truly severe cases, brittle solder can result in components literally falling off a board.) For those of you who may recall Nvidia's solder problems from several years back, this is completely unrelated -- the problem here is in the solder compounds Cray
used, and has nothing to do with Nvidia.
Small changes in solder formula can have significant long-term impacts on component lifespan, and this isn't the first time we've seen companies struggling with moving away from lead-based solder. Cray is working to repair the damaged cabinets and the Titan team at ORNL
is hoping it can have the problem resolved by early summer. ORNL's scientific director, Jeff Nichols, emphasized that laboratory wasn't willing to accept a 92% stability rate or the erratic results it had seen, and looked forward to bringing the system up to full power. ""We take it very seriously," he said. "It wasn't enough for us."