Cloudflare Apologizes For Massive Outage & Details What Went Horribly Wrong

Cloudflare lobby sign.
An event at Cloudflare that crippled large portions of the internet yesterday was not caused by a cyber attack or malicious activity, either directly or indirectly, the company confirmed in a blog. If that's the case, then what caused all the ruckus? Turns out it was a change in permissions to one of Cloudflare's databases that triggered an unfortunate series of events.

Following the change in permissions, the database began outputting multiple entries into a so-called 'feature file' employed by Cloudflare's Bot Management system. In doing so, the feature file unexpectedly doubled in size and was then propagated to every machine in Cloudflare's network.

"The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail," Cloudflare explains.

Initially, Cloudflare suspected it was being overwhelmed by a hyper-scale distributed denial of service (DDoS) attack. It was a reasonable hypothesis at the time, given that DDoS attacks have been growing in scale—to wit, Microsoft Azure recently thwarted a record-breaking 15.72Tbps DDoS attack.
It didn't take long for Cloudflare to identify that it wasn't a DDoS attack at play, but basically a file error. Cloudflare was able to mitigate the error by replace the feature file with an earlier version that wasn't above the file size limit.

That wasn't the end of it, though. As systems came rushing back online, Cloudflare spent the next several hours mitigating the increased load on different parts of its network. This would explain why sites would sometimes load and sometimes not, which is something we observed here at HotHardware.

"We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today," Cloudflare stated.

For anyone interested, Cloudflare's detailed blog takes a deeper dive into how it handles traffic and processes requests. It's an interesting inside look at how the sausage is made, so to speak.