Top 10 Data Center Disasters that could ruin your business…
The Gaylord Opryland Hotel flooding as part of the recent flooding in Nashville got me thinking…What is the worst disaster you can imagine in your data center? I came up with 10 scenario’s that will likely keep some data center operators and customers up at night just worrying about them (sorry… )
This one doesn’t need a whole lot of explanation – Check out our post about the Opryland Hotel Flood and the other pictures of that flood…
We posted about fires recently – some tragic – others with no downtime, but it shows that fires can happen anywhere – at any time – for any reason. Are you protected from a fire based disaster?
8. Fiber Cut / Loss of Network
Fiber Cuts are fairly rare in most areas, but what happens when a router goes haywire? Or when someone mis-configures a routing table (either on accident or on purpose…)? Network issues cause a ton of outages every year, what are you doing to prevent or mitigate the damages from a Network based disaster?
7. Power Failure
Power failures are often a symptom or result of another type of disaster, but not always. Faulty breakers, generator failures, rolling blackouts, etc all can cause an entire facility (or campus?!?) to go dark in an instant. I have tragic memories of using my cell phone as a light in a Sacramento data center several years ago after the data center had a faulty breaker trip unexpectedly. Couple that with another outage the next day when a UPS maintenance worker took down a section of the UPS to replace a battery without a MOP or approval and you have some very unhappy customers that experienced 2 power outages in less than 24 hours.
What are you doing to prevent that in your environment?
6. False Redundancy
This happens more often than is should. I have personal experience where a service was turned up in a real quick hurry and the redundant system was sitting waiting for a few clicks to finish the installation. Not a big deal, if anyone noticed. 6 months later when a hard drive failed, and the system went down the real answer was discovered…A half hearted, super quick, undocumented install of a critical service wasn’t completed properly and ended up with no monitoring, no redundancy, and barely any support. All of this happened in a company that had some of the strictest and best followed processes I have seen…all undone by someone trying to do someone else a favor and get this system up today.
What would you do if this happened in your organization?
5. Human Error
This is by far my favorite data center story I have ever heard! Now, there are about a billion different ways human error can cause a data center disaster…but this one takes the cake…
A Sys Admin, we’ll call him “Fred”, was sitting at his desk when a red light popped on the monitor. He walked down to the lab to check it out, stopping by the coke machine on his way. Now, Fred never brought liquids into the data center, but this time he was just going to plug in a console and look at the screen…he could do that with one hand, right? Except when he saw the error, he knew how to fix it, so he set his Mountain Dew on the top of the horizontal PDU. When he bent down to reach the monitor plug on the back of the server, something hit the can, and knocked it over…it spilled onto the PDU and shorted it out. That short caused a cascading failure that took down the entire environment.
“Fred” is now looking for a new job. If you are hiring, drop me a line and I will connect you…
While earthquakes only affect some parts of the world directly, the chain of events around an earthquake can cause outages world wide. Imagine what would happen if your primary network connection from your seismically stable data center went through One Wilshire in LA, or 200 Paul in San Francisco. An earthquake in California could take down a good part of the western states, even if momentarily. Now, companies hosted in earthquake zones are generally well aware of the risk, and HOPEFULLY have good plans in place to recover from another 6.5+ earthquake.
How would you protect your organization against earthquakes?
3. Hacking / Malicious Code
With a new virus strain or OS vulnerability emerging at least weekly, hacker attacks and malicious code plants represent an ongoing threat to both internal back office and Internet accessible systems. How much of a threat is this? Try 4000 documented Denial of Service (DoS) attacks each week; $10 billion spent annually in the US alone for Virus and Malicious code protection and removal; and 40 million users’ credit card information stolen in a single incident when TJMaxx, Marshall’s and related companies site was compromised. The Privacy Clearing House estimates that 140 million personal records have been compromised since 2005. The statistics go on and on – this is no idle threat.
How safe is your environment really?
2. Data Corruption / Loss
Data Corruption happens every minute of every day around the clock. Modern storage technology compensates and corrects most corruptions, however, if the systems are not properly configured for storage system fault tolerance, data corruption becomes an immediate risk. In a worse case scenario, corrupted data or a failed drive component can quickly lead to a loss of critical data and significant downtime – even if backups are available for recovery. And what if backups are not regularly tested and maintained? The loss of data can then go from inconvenient downtime to irrecoverable losses in revenue, customers, even business stability and credibility.
When was the last time you actually tested the RESTORE part of your “backup and restore” process?
1. Cascading Systems Failure
I already mentioned “Fred” and his Mountain Dew induced data center disaster, but that is a single, over simplified example of a cascading failure. These type of failures are common in power and power distribution, but also carry over into cooling quite often. I have seen underqualified technicians and data center operations personnel adjust the set point on the data center chillers wrong, causing them to freeze and shut down, which causes the data center to overheat, which leads to equipment failure… … …
On the power side, I see all too often where people think that “open outlets” = “available power”…this not only causes my brain to hurt, it causes power failures…pretty often…and the worst part about it, is overloaded circuits ALMOST ALWAYS cause cascading failures. If one circuit is improperly balanced, it’s a safe bet other circuits are too. When you overload one…it cascades…pretty simple math…
How many of your circuits are improperly balanced?
Do you have any anecdotes of disasters you’ve seen or heard about? Let us know! I’ve asked a few questions above, and I would love to hear your responses and other thoughts in the comments!
Comments are closed