Solder Problems Trip Up Titan Supercomputer, Delay Final Certification

rated by 0 users
This post has 2 Replies | 0 Followers

Top 10 Contributor
Posts 26,699
Points 1,207,895
Joined: Sep 2007
News Posted: Thu, Mar 14 2013 6:54 PM
The Titan supercomputer buildout at Oak Ridge National Laboratory was a much-publicized upgrade of the older Jaguar supercomputer that already existed on-site. Nvidia and AMD heavily publicized the facility's decision to combine Opteron processors with Nvidia's new K20/K20X-based graphics cards. When it launched the GTX Titan last month, Nvidia told us that one reason it settled on that name was because of the association with the Titan project.

That deployment, it turns out, isn't working like it should. Stability problems have kept the supercomputer from passing final stability tests. The system needed to hit a 95% successful completion rate in a 14-day series of stress tests, but only managed 92-93%. It took the troubleshooting team several months of testing to narrow down the problems.

The problem, it turns out, is the gold-tin mixture in the solder used on Cray's motherboards. There's some indication that the traces impacted are in the GPU socket; GPU-CPU communications have been adversely affected. It's a claim that makes sense, given what we know of gold-tin solder. If too much gold is used in the solder mixture, the solder can become brittle. The recommended limit is 3% gold by weight, though some evidence suggests that solder can become brittle at even lower percentages.

Here's an example of badly brittled solder at 3400x magnification. It looks like there are plates and ridges within the solder. This is more or less the opposite of what you want a joint to look like (and in truly severe cases, brittle solder can result in components literally falling off a board.) For those of you who may recall Nvidia's solder problems from several years back, this is completely unrelated -- the problem here is in the solder compounds Cray used, and has nothing to do with Nvidia.

Small changes in solder formula can have significant long-term impacts on component lifespan, and this isn't the first time we've seen companies struggling with moving away from lead-based solder. Cray is working to repair the damaged cabinets and the Titan team at ORNL is hoping it can have the problem resolved by early summer. ORNL's scientific director, Jeff Nichols, emphasized that laboratory wasn't willing to accept a 92% stability rate or the erratic results it had seen, and looked forward to bringing the system up to full power. ""We take it very seriously," he said. "It wasn't enough for us."
  • | Post Points: 20
Top 150 Contributor
Posts 656
Points 5,955
Joined: May 2008
Location: Stockholm
mhenriday replied on Sun, Mar 17 2013 10:15 AM

Couldn't help being reminded of my (unsuccessful) attempt to re-ball a motherboard from a 17.3" HP Pavilion....


Top 150 Contributor
Posts 756
Points 7,645
Joined: Nov 2012
Location: Dallas, Tx
Dorkstar replied on Mon, Mar 18 2013 2:16 PM

I'm always amazed at an engineers ability to identify problems like these.  They really must have some educated fella's working up there.

  • | Post Points: 5
Page 1 of 1 (3 items) | RSS