What have we learned from Google's latest outage? That 99.9 percent uptime doesn't matter during the other one-tenth of one percent.
Apply today for a FREE subscription to CIO Magazine!
What have we learned from Google's latest outage? That 99.9 percent uptime doesn't matter during the other one-tenth of one percent.
If your company's IT department can guarantee 99.99% up time for email and other server applications they are not realistic, and honestly, are risking their jobs by making the claim if you actually hold them accountable. If Google can truly achieve a 99.9% SLA, they are better than 99.99% of IT Departments in the US.
David, what more could Google have done? They were immediately aware of the problem and posted timely updates on the status dashboard. They even posted a workaround stating that users can access their email with an IMAP or POP client such as Outlook or Thunderbird.
From an ITIL® perspective, this means Google's event management and incident management processes are very effective. Users don't want or care about mind-numbing technical details during the incident, they just want service restored as soon as possible, which is the primary objective of incident management.
A detailed explanation of the cause and a permanent resolution (i.e., service improvement) should be provided after performing a root cause analysis. Again from an ITIL® perspective, this is the outcome of the problem management process.
Your suggestion that Google needs to "add another nine" to their service level is ludicrous. Problems will always happen. Even the electrical grid fails from time to time, and although its service level might be better than 99.9%, it's still not perfect.
If an organization can't live with 0.1% down time (an average of 10 minutes per week), they should either plan for this contingency or provide the service internally. Personally, I think most organizations would be hard pressed to deliver a better service level themselves.
You and the other "prophets of doom" are blowing this incident way out of proportion. Instead of just focusing on failures and demanding near perfect service levels, we should be more concerned with how service providers respond to incidents and what steps they take to mitigate future occurrences.
Read this article for a different perspective on the Google outage.
It's a bit like airline disasters. They don't happen very often, and statistically air travel is still much safer than the car. But when they do, they affect many more people and gather a lot of column inches.
It also reminds me that way back in the cradle of IT, when I was a raw postgrad, I had to work hard to persuade the powers-that-were that if the University computer went down for two minutes when I'd just spent an hour on an unfinished problem, I lost 60 minutes' work not two. The impact of outage depends on what it interrupts and when. It's not just the uptime percentage.
Google's success depends at least partly on the assumption that their mega-scale infrastructure is more reliable than an enterprise can achieve for itself, because it's their primary business and they're better at it. It's that assumption that's being called into question.
There's a lot of buzz about Windows 7 out there. Each month in our webcast series, listen to analysts and customers discuss how Windows 7 and the Windows Optimized Desktop is impacting large companies around the world. Learn how they evaluated Windows 7, including the cost of deployment, deployment strategies, and tangible benefits.
Sponsored by Microsoft
Listen to on-demand Recordings »
Service Level Management Best Practices Life Cycle Overview - Improve Service Levels
Best practices for Service Level Management (SLM) is a process for consistently meeting customer requirements and delivering on IT's promises. See the steps required to ensure high-quality SLM.
Sponsored by Compuware
Read this White Paper »
Keeping Your Members Safe from Online Scams and Predators
In order to keep fraudsters out, romance sites must deploy effective solutions that look at information independent of what is supplied by users. A device fingerprinting solution such as iovation ReputationManager™ provides unique insight into the computers being used to create multiple accounts and exposes hidden device-account relationships that identity-based fraud solutions often miss.
Sponsored by iovation
Read this White Paper »
| CIO MARKETPLACE | buy a link![]() |
Use your Intranet to manage Software Licenses, plan for Windows XP/2000 upgrades, do Security Audits and more. Click to try and ask for our white paper - PC Management for the Internet Age.
UNIX and Linux Performance Tuning SimplifiedSarCheck is a performance analysis and tuning tool for most UNIX & Linux operating systems. It produces recommendations with full explanations, and both supporting graphs and tables. Get the most from your hardware by keeping your systems tuned.
.NET Developer Wanted - Boston - Local CandidatesAIR provides sophisticated analytical tools and software systems to help companies manage that risk. We are seeking a Sr .NET Developer with 8-10 yrs exp in .Net & OO development. ASP.NET, VB.NET skills required. Annual bonus - Apply Now
Get More from Your Oracle DatabaseDBAs are constantly challenged to increase performance and keep costs down. This paper discusses the industry best-practice Wait-Event analysis and how Confio has combined this with their Resource Mapping Methodology to optimize DB performance.