|A few words on uptime
||[Jan. 16th, 2005|11:31 am]
It sure sounds good to have 97% availability, until you realize that means you are allowed a week and a half of downtime per year. "Critical" infrastructure is supposed to have "five 9's" reliability - a metric of 99.999%, which translates to a bit over 5 minutes per year of downtime. When AT&T designed its first widely deployed digital switch, the development team intended less than one day of outage per 40 years. Granted, no one kept a 1ESS in service for 40 years to find out if that was true, but they hummed along nonetheless. You maintain availability through redundancy - by having multiple physical methods of arriving at the same end result. For example, to protect against power loss to your data center, perhaps you would put in a battery-backup UPS, which would protect against modest ( < 30 min) power failures, transients, and surges. If you want more reliability (i.e, you need to be able to last longer than 30 minutes without power) then add a generator and auto-transfer switch to your data center. Now when power fails, the UPS powers load until the generator comes online (well within 5 minutes) then the generator powers the UPS just as commercial power would have. Now, however, we come to a problem.|
Total availability is calculated by multiplying the reliability of each component in series together. For example, with our power model above, you would multiply the reliability rating of the generator, transfer switch, and UPS. Say that the generator has a reliability rating of 98%, the transfer switch 99.999%, and the UPS 99.99%. It seems that the total availability should be 98%, the rating of the least reliable component. Well, lets check; we multiply these figures together and get... 97.98%! Wait a second! that's less than each of the individual ratings! How can we instead get higher reliability?
Parallel systems are the way to go. Take into consideration commercial power (also commonly known as bypass power), say it has a reliability of 96%. Calculate the total availability of this leg the same way as above, except replace the generator availability with commercial power availability. So, .96 * .9999 * .99999 = .9598 To calculate parallel total availability, add each possible combination together (since either commercial power or generator power could be running this mythical data center) then subtract the product of the two numbers. So, (.9798 .9598) - (.9798 * 9598) = .9991, or a total downtime of about 8 hours per year, which is pretty good.
On the third hand, over-parallelization can start to make things unreliable. Say you have two connections to commercial power, one each to different substations served by different long distance transmission lines. You also have two separate generators and UPS systems designed to run the two separate power busses in the building for a month or so. You have everything interconnected with everything else to reroute any kind of power to any outlet in the building. Then some idiot comes along and pops the EPO on the second floor.
I could go on and on about unintended consequences and the weird ways things break, but I won't, at least, not much more. Okay, so all the great plans failed and the power went out to everything at the same time. People are paged. Curses are heard. Everyone drives to the data center and stands around in the dark. The power just came back on. All the equipment comes up happily, but not in the correct order. The servers are up before the network. The tape array can't see the clusters. The databases are unmounted. The raid cards may not have remembered the last things they were supposed to write to the drives. Important People are calling, demanding to know when it will be up again. Gah.