Hide your chickens,
Facebook is pouring egg on its face like nobody's business this week, and may not be finished. It started when
60 Minutes aired an interview with a former Facebook product manager who secretly copied tens of thousands of seemingly damning documents, alleging the social media giant has a culture of
putting profits over user safety (let's all feign surprise). Then the next morning, Facebook went dark, along with several of its entities, including WhatsApp, Messenger, and Instagram, among others. Was Facebook hacked? Did someone trip over a power cord? Could it be aliens?
While Facebook users were scrambling to figure out an alternative place to upload pictures of their meals, engineers at the social network were scrambling to fix
whatever was broken, and get the social network and its adjacent properties back online. And to their credit, they eventually succeeded.
If taking Facebook at its word, configuration changes caused what Downdetector described as "one of the largest ever tracked" on its site, in terms of the total number of reports (over 14 million) and the duration (a literal eternity, depending on who you ask).
Facebook's Explanation For The Complete Outage Of All Its Web Properties And Services
"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt," Facebook explained in a blog post.
So, that's Facebook's
official stance on the outage. Team Zuckerberg also claims there is
"no evidence of user data [being] compromised as a result of this downtime." Just the usual data tracking that knows more about you than your own mother.
For anyone who is interested, the folks at Cloudflare break down
how Facebook fell so hard, with a decent explanation on the technical side (as it relates to DNS, BGP, and other important acronyms). The plain language explanation is, Facebook went dark, it wasn't Cloudflare's fault, and it was as if Facebook just disconnected itself entirely from the web, as if someone had pulled all the power cables from the site's data centers. Hmm...
"Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else's DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses," Cloudflare explains.
Hilarity ensued, as memes and jokes quickly permeated the web...
There was also a funny rumor saying Facebook had to bring in someone with an angle grinder to cut through the server cage, which is one of those things that sounds so ridiculous, it just has to be true. The folks at The New York Times initially thought so too, but later clarified that Facebook engineers "had issues getting in because of physical security, but did not need to use a saw/grinder." Perhaps a Dremel, then?
HotHardware Writers Weigh In On Facebook's Outage
All of this is to say, nobody knows for sure what the hell happened, except Facebook. But do you take
Zuckerberg and company at their word? We have some thoughts...
Brandon 'Tinfoil Hat' Hill: I personally think it was an inside job, sparked by the 60 Minutes piece. The fact that Facebook went down so hard and for so long leads me to believe that it was perpetrated by someone with great knowledge of the inner workings of the company's network structure. They knew where to hit and how to do the most damage on a broad scale.
Ben 'Something Smells Funky About This' Funk: Personally, I think the timing is a little too convenient — or maybe inconvenient — to not be related. Frances Haugen proved that a Facebook employee can actually have a soul, and I wonder if some other FB employee has also not totally sold out their soul to work for the company. Nobody should be able to do what was done from the outside, so I think somebody saw an opportunity to prevent Facebook from controlling the narrative on its own platform.
PS: Yeah, I think Brandon nailed it.
Dave 'Site Owner So He Gets A Nice Caption' Altavilla: There were also multiple reports of this (angle grinder) and one Twitter account was actually suspended too. It just all seems fishy to me.
It's just suspect to me that it was so simple as a BGP misconfig such that it brought down a highly sophisticated data center infrastructure-based company with massive resources, both internally and externally.
And then there's the timing of it, the very next day following the 60 Minutes bomb.
I'd lean toward sabotage. Did someone internally know Facebook's Achilles' heel and help stage this (inside job)?
Something isn't adding up. We've had BGP router updates fail in our previous colocation environment here at HH. The fix came a couple hours later and only took like half of our traffic offline.
It just seems too catastrophic that all services—FB, WhatsApp, Instagram and Oculus—were all dark and all day long.
We may never know the real story, especially if it is security related.
Marco 'You Asked For A Paragraph So Here's An Essay' Chiappetta: An inside job seems plausible to me, but if all of the moving parts prove true, there are some things that just don’t make sense. If it really was a BGP misconfiguration (whether intentional or accidental), Facebook would have known something was amiss almost instantly and would have been able to react relatively quickly. If they were inadequately prepared to make that change and it caused catastrophic failure across all of their properties, at the very least, that screams of gross incompetence.
Even still, you would think a massive operation that includes FB, Instagram, Whatsapp, Oculus (and others), and the company’s own internal work / tracking tools and VPN, wouldn’t be tied to any single data center. The NYT was reporting that some employees’ keycards were no longer working to access certain buildings. And others were reporting that the main reason for the prolonged outage was because techs couldn’t gain physical access to a particular rack of servers in a data center. Having a single point of failure that would affect every single property, and limit physical access, and bring down employees’ ability to work doesn’t seem possible without the outage being specifically designed to do so.
You don’t bring down 40M+ square feet of data centers, across 18+ campuses worldwide, in addition to executive offices, with a router misconfiguration.
Chris 'Marco Wrote So Much I'll Just Say This' Goetting: "Never attribute to malice that which is adequately explained by stupidity" - Hanlon's Razor
Brittany 'Chin Up, FB, Better Days Are Ahead' Goetting: I know that this is probably a very underwhelming response, but my personal opinion is that Facebook was having a very bad day.
Paul 'Probably Will Be Fired' Lilly: Y'all are giving Facebook too much credit by shifting blame (credit?) to hackers or a DDoS or whatever. The answer is right in front of you...a robot malfunction. Zuckerberg blew a capacitor, which prompted a killswitch among his robot minions, which then began wreaking havoc in the data center. Cloudflare basically says so if you read between the lines. It's like you folks have never watched a Boston Dynamics video or seen Terminator. Be nice to bots.
What do you think? We'd be delighted if you'd share it with us in the comments section below, and/or in our
Discord channel.