Cisco Blames Router Bug On Cosmic Radiation, Reeks Of Weak Sauce Excuses, Fantastic Four Unavailable For Comment

thing 1
Given the quick pace that today’s technology progresses, we’re used to seeing products with a few software and even hardware bugs. While software bugs can often be squashed with competent coding, hardware bugs can be a bit more difficult to tame.

Such is the case with a bug that afflicts Cisco’s ASR 9000 Series routers. While big iron networking appliance bug reports for hardware issues are not uncommon, it’s Cisco’s explanation of the malady that's causing some consternation in tech and IT circles. Cisco Bug CSCuz62750 is described as causing “partial data traffic loss”, with data loss sometimes even occurring after a CRC (Cyclical Redundancy Check). The kicker here, however, is that Cisco says that it has observed the software errors on an operational network and that it could possibly be triggered by “cosmic radiation.”

Cisco ASR 9000

Say what? Cosmic radiation? It’s not every day that we hear that explanation presented to explain away software or hardware problems, but Cisco has made no qualms about going down that path. While cosmic radiation has been known to cause bit errors, modern semiconductor chips often have long had error checking incorporated into their designs to mitigate against this. In fact, end-to-end bit-error detection and correction are often hallmark requirements for much of the silicon used in enterprise appliances like a Cisco data center router. 

One redditor bluntly calls out Cisco for traveling down the cosmic radiation rabbit hole, writing:

Cosmic radiation does not home in on a specific part of your box.... It would also hit the control plane and other parts. ECC memory tends to make this a non-issue. I'd call bullshit in this case. Request an EFA (Engineering Failure Analysis) to see if the hardware itself is at fault. If EFA comes back clean, then it's most likely software.

Another redditor confirms the claims of cosmic radiation being a real threat, but backs up the above comments that it’s highly unlikely for one single component to show failures without other surrounding equipment also being affected. The Ex-TAC engineer described an incident where a solar flare caused widespread destruction at his place of work:

We lost power supplies on numerous CPE devices, one customer lost their copier, server, router and firewall - all devices completely dead and all needed to be replaced. All were in the same room. The motherboard and controller card on the server were dead, the motherboard on the copier was dead, as was the router and firewall.

cisco 1

And yet another reddit veteran put the onus back on Cisco, writing:

ECC memory (usually SECDED) handles the possibility of cosmically-induced bitflips. Even if this is in the TCAM you'd think Cisco's prices would cover an extra bit or two. Cisco is very proud of using their own chips. If I was given this explanation I'd ask for the part number of the replacement with ECC or I'd switch to merchant silicon. In actuality it's the fallback answer of a diagnosis of exclusion.

So yes, cosmic radiation no doubt can be the source of bit errors in modern hardware, but properly designed components in mission-critical environments should have already safeguards in place to protect against such phenomenon.

And if only one machine in a data center is routinely suffering from cosmic radiation-induced issues, while the rest of the data center appliances hum along happily without a blip, you know there's going to be a whole lot of finger-pointing going on with IT managers, CTOs and CIOs in more than a few conference rooms.