|Simon says stop
||[Dec. 19th, 2005|08:46 pm]
In any business, metrics are used to determine the success of each business unit relative to the previous quarter and/or previous year. The executive abstract of such a report tends to validate each unit using the same discipline code used in Kindergarten; green, yellow, or red; each indicating a passing grade for the metric, a warning grade, or a failing grade.
Today was a very red day for the network operations department. Murphy was with us all the way.
Perhaps an ordered list, or timeline, would work well here.
We shall see how tomorrow turns out.
- Major database work was done on our production enterprise application software (if you're in the business, think PeopleSoft or SAP equivalent) on Saturday. A complete backup exists from both before and after the work, and change controls were performed correctly. However, it effectively made a setpoint which complicated future things. I was happily shopping in SF, and not in any way affiliated with the above.
- Early Sunday morning, the primary air conditioner fails for the data center. The temperature sensor on the secondary broke last week, and hasn't been fixed yet, so it didn't automatically turn on when then temp hit 80 degrees F. Temperatures continue to rise. No one notices.
- Sunday morning, Finance begins work on payroll, using the new database schema. They go home around 3pm, with payroll complete; just need to audit and print checks on Thursday!
- Sunday night, a drive fails on the external RAID which houses the production database. The data center has been at near 100 degrees for about 18 hours at this point.
- The alarm triggers on the RAID, sending emails to places which are only monitored during working hours. For some reason (cough, *HEAT* cough), the RAID did not automatically begin rebuilding to one of the hot spare drives. The production server attached to the RAID shits itself over a couple thousand SCSI bus resets, and goes catatonic.
- Other monitoring software notices production is down, and sends emails to the emergency pager. The duty network tech goggily looks at it and decides it can wait til morning; he'll just leave a bit earlier to get to the office before everyone.
- I sleep in, and get a late start to work.
- Duty Network Tech gets stuck in horrendously bad traffic due to an accident on I-80; figures I'll get there first, and contacts me.
- I receive his communique and begin speeding.
- Tech Who Isn't Even Supposed To Be In Today, But Religiously Checks His Email at Oh-Dark-Thirty arrives in the data center. Immediately props the secure doors open and manually activates the secondary air conditioner. Things begin to cool.
- The alarm from the RAID box continues to beep, but no fault lights are on the RAID.
- The production server is discovered to be catatonic and is restarted. Upon restart, its RAID realizes a drive is down and begins automatically rebuilding to a hot spare.
- ...insert an hour or so of realizations that the production db is hosed...
- ...insert a few hours dealing with adaptec support, asking pointed questions akin to the final scenes of Revenge of the Sith: "How could you do this? We trusted you! arrrrgh!!!" &tc...
- ...insert moment of clarity where focus is changed from recovering a corrupt db to restoring a known good db...
- ...insert another hour or three blowing away the array and recreating from backup...
- ...insert the invocation of many magical stored procedures and rights recreations that I had just finished not a month ago...woot for proactive disaster mitigation, eh?! eh.
- I get the short straw to be the non-exhausted early man in tomorrow to find out how things went. I go home.
- ? ? ?
- Either a) I arrive at work, EARLY, tomorrow, and discover all is swell. : or b) I have three smelly netops guys surrounding a console, squinting at the morning light, with a horde of angry Finance people coming for the door with pitchforks and torches.