Google’s E-mail was down yesterday for 100 minutes – annoying many, and hurting the productivity of a number of companies that rely on Gmail.
You could chalk this up to the danger of relying on cloud apps… but, Google is supposed to be the cream of the Cloud providers. What happened? Well, according to the Google Blog:
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system “stop sending us traffic, we’re too slow!”. This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google’s architecture), distributed the traffic across the request routers, and the Gmail web interface came back online.
On one hand, Google didn’t understand their network well enough to know the effects that the change would have. On the other hand, Google did some things right – their monitoring software alerted them to the problems before the users started calling Google, they were quickly able to diagnose the problem, and that lead to a simple and direct solution to get up and running relatively quickly. 100 minutes may seem like a long time, but from the problem to the repair, it’s actually relatively short.



No comments yet.