Facebook hit by “worst outage” in four years
Facebook has apologised for its worst outage since it went mainstream, after a configuration error took the social-networking site down for several hours.
The site first failed on Wednesday, and was down again last night. “This is the worst outage we’ve had in over four years, and we wanted to first of all apologise for it,” Robert Johnson, director for software engineering, said in a note on the Facebook engineering blog.
This is the worst outage we’ve had in over four years, and we wanted to first of all apologise for it
“The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition,” he explained. “An automated system for verifying configuration values ended up causing much more damage than it fixed.”
Facebook uses an automated system to check cached configuration values against a persistent copy. However, the company had made a change to those default values that the system thought was invalid, leading it to keep checking and rechecking, which caused hundreds of thousands of queries on the database each second.
Once the databases were overwhelmed, the problem worsened, as the system saw the error messages as more invalid values, causing it to send even more queries. “We had entered a feedback loop that didn’t allow the databases to recover,” Johnson said.
“The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site,” he said. “Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”
Johnson said the site was back up and running, with the flawed system turned off. “We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes,” Johnson said.