RackSpace, a web hosting data center based in Dallas had a major power outage yesterday. The outage affected GigaOm, Laughing Squid, 37 Signals, and many others. TechMeme, my favorite blog news aggregator, has lots of blogs following the story.
You might recall a similar massive power outage a few months ago at 365 Main, a major data center in San Francisco. That outage knocked out sites like CraigsList, Technorati, LiveJournal, TypePad, AdBrite, Second Life and Yelp.
You might ask where is the backup power supply? Where are the redundant systems they told me about? Unfortunately, these power outages and unplanned down time happen frequently. Most of these web hosting and data center companies are startups themselves. They don't have the "bullet proof" never fail data centers that you read about.
Business users demand 24 X 7 X 365 uptime. This is why Google and Microsoft are building massive data centers costing $500 million each all over the country. There are only a few companies in the world that have the financial resources to build multiple data centers, and that have the technical skills to keep them running efficiently.
Free services and social networks can get by with some power outages and a few data losses. Users will complain...but life goes on. Not so with business users. There are huge impacts, both financial and legal, when service is interrupted or data is lost.
Software is increasingly moving towards hosted services. Microsoft is working hard to deliver "Software + Services" to give users the best of both worlds. Keeping data synchronized on clients and servers, and delivering a great user experience while on-line and off, is a big challenge.
The Dallas and San Francisco data center crashes may make people stop and think about who they want to trust with their data and services. That is why Microsoft and Google are spending billions on data centers.
Subscribe - To get an automatic feed of all future posts subscribe here, or to receive them via email go here and enter your email address in the box in the right column.
What do you recommend for those of us that want exceptional availability, but don't want to invest $500m in a data center. (smile)
Bh.
Posted by: Brian P Halligan | November 14, 2007 at 08:50 AM
There are only so many ways to architect and operate a Tier1 hosting facility, and even the best of them have failures. Ironically, it's usually the fail-safe software or switching systems that fall apart when they are called upon. Even the best datacenters need to have full-system failure tests done monthly (not virtual tests of the system, real-deal "pull-the-plug" tests to ensure the systems are working.
As a SaaS site operator, the only way to control your destiny is to take matters into your own hands and run at least two sites on different networks. Using the same hosting provider is even risky because in some cases it's the network routing backbone (BGP/OSPF) that melts down).
Google and MSFT will be able to control their destiny better but will still face the same issues and need to actively test the redundancies built into the architecture. Sometimes all it takes is water in the fuel line of a diesel genny to take down an entire DC. No one is immune.
Posted by: Ari Newman | November 14, 2007 at 12:47 PM
Ari,
Good comment.
The challenge for most ISVs is that multi-site deployments are 'hard', look what happened to SalesForce when they implemented HA, and that there is little to no premium placed on redundant locations by your customers.
In other words. You can spend a significant amount of time and money (multiply by 3 if you are buying/running the gear on your own) on a multi-location deployment. Just don't expect your customers to be willing to foot the bill.
Real key here is how does the industry provide the multi-site capability to ISVs for little to no premium.
Posted by: john rowell | November 15, 2007 at 10:43 AM
I think neither of the hosting services you mentioned were startups. I remember RackSpace advertising on TV more than 5 years ago, and 365 Main hosted services that predate the Web. The failure is more a statement about the reliability of statements about failure-proof reliability than the maturity of hosting companies or massive capitalization.
For comparison, the redundant and protected air traffic control system at the Dallas/Fort Worth airport (as state of the art as the FAA gets) was taken down for several hours by a maintenance technician that dropped a screwdriver in a very bad place. This happened several years ago. Houston had similar problems due to flooding of basement "protected" systems during Hurricanes Katrina and Rita.
Posted by: Walter Lounsbery | November 19, 2007 at 10:07 AM