Log in

No account? Create an account
Whizistic's Lair [entries|archive|friends|userinfo]

[ website | never working right seemingly ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Links:| arstechnica.com the-whiteboard.com userfriendly.org ctrlaltdel-online.com slashdot.org ]

A few words on uptime [Jan. 16th, 2005|11:31 am]
It sure sounds good to have 97% availability, until you realize that means you are allowed a week and a half of downtime per year. "Critical" infrastructure is supposed to have "five 9's" reliability - a metric of 99.999%, which translates to a bit over 5 minutes per year of downtime. When AT&T designed its first widely deployed digital switch, the development team intended less than one day of outage per 40 years. Granted, no one kept a 1ESS in service for 40 years to find out if that was true, but they hummed along nonetheless. You maintain availability through redundancy - by having multiple physical methods of arriving at the same end result. For example, to protect against power loss to your data center, perhaps you would put in a battery-backup UPS, which would protect against modest ( < 30 min) power failures, transients, and surges. If you want more reliability (i.e, you need to be able to last longer than 30 minutes without power) then add a generator and auto-transfer switch to your data center. Now when power fails, the UPS powers load until the generator comes online (well within 5 minutes) then the generator powers the UPS just as commercial power would have. Now, however, we come to a problem.

Total availability is calculated by multiplying the reliability of each component in series together. For example, with our power model above, you would multiply the reliability rating of the generator, transfer switch, and UPS. Say that the generator has a reliability rating of 98%, the transfer switch 99.999%, and the UPS 99.99%. It seems that the total availability should be 98%, the rating of the least reliable component. Well, lets check; we multiply these figures together and get... 97.98%! Wait a second! that's less than each of the individual ratings! How can we instead get higher reliability?

Parallel systems are the way to go. Take into consideration commercial power (also commonly known as bypass power), say it has a reliability of 96%. Calculate the total availability of this leg the same way as above, except replace the generator availability with commercial power availability. So, .96 * .9999 * .99999 = .9598 To calculate parallel total availability, add each possible combination together (since either commercial power or generator power could be running this mythical data center) then subtract the product of the two numbers. So, (.9798 .9598) - (.9798 * 9598) = .9991, or a total downtime of about 8 hours per year, which is pretty good.

On the third hand, over-parallelization can start to make things unreliable. Say you have two connections to commercial power, one each to different substations served by different long distance transmission lines. You also have two separate generators and UPS systems designed to run the two separate power busses in the building for a month or so. You have everything interconnected with everything else to reroute any kind of power to any outlet in the building. Then some idiot comes along and pops the EPO on the second floor.

I could go on and on about unintended consequences and the weird ways things break, but I won't, at least, not much more. Okay, so all the great plans failed and the power went out to everything at the same time. People are paged. Curses are heard. Everyone drives to the data center and stands around in the dark. The power just came back on. All the equipment comes up happily, but not in the correct order. The servers are up before the network. The tape array can't see the clusters. The databases are unmounted. The raid cards may not have remembered the last things they were supposed to write to the drives. Important People are calling, demanding to know when it will be up again. Gah.


[User Picture]From: lawrencebacchus
2005-01-20 10:43 pm (UTC)
reading this reminded me that a few days ago, a good portion of the hilltop area was out power because some dolt ran his truck into a main transformer in front of ben franklins, killing power to all the stoplights, and stores from victor to the river. when I saw the streetlights and everything go dead, I immediately knew "oh, somebody just got juiced." I was going into winco for a late night cheesecake...and all the checkers had had to start over when the power went to backup - they had lost their transactions in progress.
(Reply) (Thread)
[User Picture]From: whizistic
2005-01-23 09:32 am (UTC)
heh, I'm impressed they had backup power. I'm waiting to be stuck in an elevator on campus when the power goes out and see how that really goes. Every building has a generator of some sort, some scarier than others. Every era and manufacturer under the sun -- cat, onan, honda, kohler, volvo. Most of them are propane powered from a standard residential propane tank right next to it. They are all hidden behind hedges. weird.
(Reply) (Parent) (Thread)
[User Picture]From: lawrencebacchus
2005-01-24 11:22 am (UTC)
all pretty reliable brands...though I wasnt aware they made those in the lpg burning variety...I suppose it wouldnt be too much to convert from diesel to lpg.
(Reply) (Parent) (Thread)
[User Picture]From: whizistic
2005-01-24 03:57 pm (UTC)
http://www.onan.com/pdf/standby/S-1381.pdf is one I've seen,
http://www.cumminspower.com/Commercial1/SparkIgnited/S-1327.pdf is another. Seem to use Ford motors in them, adapted from gas to either lpg or natural gas (or both; found one that has a feed from each -- main library).
(Reply) (Parent) (Thread)