|The flying excrement made physical contact with an air current distribution device
||[Dec. 21st, 2005|10:09 pm]
To update the unresolved drama of Monday, the pitchforks were narrowly averted:|
From: [IS Director]
Subject: [production availability]
Creation Date: 12/20/2005 4:59 AM
Production will be online this morning as soon as [senior dba] finishes
hotpacks. [senior dba] is in charge of the restoration at this point;
he'll let everyone know the moment you can be online. More to come when
I'm coherent again.
Other emails later included the words "Bill, Excellent work. Thank you." and "Perfect." in regards to my... unique writing style.
Sadly, my open items list has quadrupled. Happily, I have 5k in spending authorization to make it a bit easier next time.
Here's my wrapup email:
After getting off the phone with the nicest and most competent RAID tech I've ever worked with at Adaptec, here is a story that fits the facts as they are known and is mostly believable.
The affected array contained 11 drives; 9 as part of a standard raid-5, with two hot spares. At first, this array was populated only with Maxtor 18.2gb 10,000 RPM scsi-160 drives. Over the years, drives have failed, and the hot spare has always taken over correctly. The failed drives were replaced, sometimes with drives that are not precisely identical.
Immediately before the incident, there was one drive which was not the same as the others. It was a Maxtor 36gb 15,000 RPM scsi-320 drive, and was part of the 9 drive RAID-5 set. This is the drive that "failed" during the heat spike caused by the loss of the rooftop air conditioner. The firmware of the RAID card was the first revision released, a long long time ago.
The reason this particular drive failed is threefold:
1) The 15,000 RPM drives run hotter. Under normal operating conditions (temp < 90 degrees F), the DuraStor 312R chassis is capable of cooling a full load of twelve 15,000 RPM drives; but conditions in the data center were most certainly not "normal operating conditions" on the night in question.
2) The drive has a temperature diode which spins down the platters when the temperature limit is reached. It is not known if the earlier drives also have this feature. Attempts to access the drive whilst in this state result in a SCSI device busy; retry later type of response (understand that I'm paraphrasing the exact SCSI commands).
3) When the RAID card tries to write to the array, and sends data to that spun down drive, the card will wait for the "data saved successfully" signal. If it is not received within a certain time frame, the write request is repeated. If the repeated request is ignored repeatedly, the drive will be failed by the RAID controller.
Once the drive has been failed, the RAID card writes its failed status across all of the members of the array. Either due to the specific "temporary failure" status, or because of the age of the firmware (this behavior has been modified since), the raid card did not request the hot spare to be activated. At some point in here, something bad happened and we had data corruption and a catatonic server, which may or may not be a direct result of the spun down drive. Possible reasons for the logical corruption boil down to:
1) Drives not all identical
2) Temperature caused it on the server end, and the server wrote the corrupted data separately from the drive problem.
3) Firmware out of date
4) Data has always been corrupted. (ha!)
5) Something else.
After the system was restarted by [tech] that morning, the only information the RAID card had to go on was that it had marked that drive failed in the RAID's status table (mirrored on all drives), not if said drive had a "hard" or "soft" error, so the RAID controller started a rebuild onto a hot spare.
We have since updated the firmware to the latest version on [server]. Adaptec feels that the aberrant behavior of the raid array was a direct result of the heat spike in the data center, having an old firmware reversion, and running drives of a different type in the same raid. Ergo, Adaptec further recommends only using drives of the same era in an array.
It is currently unlikely that permanent damage has occurred, since there have been no errors thus far which are credible in our testing. We have tested both drives which were claimed to have "failed" using the Maxtor SCSIMax software numerous times. Each time, the drives pass successfully. [Tech] and I will beat on the thing some more, but we'll likely recertify it for use in a week or so. Oh, and if we reuse this box for something else, consider limiting the number of physical drives in the array, since a RAID 5 logical drive fails if two physical disks fail, and the more drives you have, the more likely it is both will fail within the rebuild window, since it takes longer to rebuild a single drive the more drives there are in the set. I could prolly explain that better with graphs :)