Sunday, September 26, 2010

Facebook Outage - not DNS related they say

Interested in why there was a 2.5 hour outage last week on Facebook?

Here's a bit of detail (and what it wasn't)...

Don't blame DNS for Facebook outage, experts say
Sys admin error appears to be the culprit, but DNS is vulnerable to similar mistakes

Network World September 25, 2010 05:47 PM ET

Experts in the inner workings of the Internet’s Domain Name System – which matches IP addresses with corresponding domain names—say the 27-year-old communications protocol does not appear to be the cause of Facebook’s high-profile outage last week.

Facebook’s service was unavailable to its 500 million active users for 2.5 hours on Thursday -- the company’s worst failure in more than four years. Initial news reports blamed the outage on DNS because end users received a “DNS error” message when they couldn’t reach the site.

"There's probably a lesson here that the problem at various times looked like DNS, but ultimately proved not to be," said Cricket Liu, vice president of architecture at Infoblox, which sells DNS appliances. "In my experience, users are quick to point fingers at DNS (perhaps because Web browsers like to implicate DNS when they can't get somewhere) but DNS often isn't at fault."

Facebook gave little detail about the cause of the outage except to say that it was the result of a misconfiguration in one of its databases, which prompted a flood of traffic from an automated system trying to fix the error.

"We made a change to a persistent copy of a configuration value that was interpreted as invalid," explained Robert Johnson in Facebook’s blog post about the incident. "This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries per second."

The feedback loop created so much traffic that Facebook was forced to turn off the database cluster, which meant turning off the Web site.

"Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site," Johnson said. He added that "for now we’ve turned off the system that attempts to correct configuration values."

Here's the complete article from NetworkWorld.