Log in

No account? Create an account
The flying excrement made physical contact with an air current distribution device - Whizistic's Lair — LiveJournal [entries|archive|friends|userinfo]

[ website | never working right seemingly ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Links:| arstechnica.com the-whiteboard.com userfriendly.org ctrlaltdel-online.com slashdot.org ]

The flying excrement made physical contact with an air current distribution device [Dec. 21st, 2005|10:09 pm]
To update the unresolved drama of Monday, the pitchforks were narrowly averted:

From:           [IS Director]
To:             [everyone]
Subject:        [production availability]
Creation Date:  12/20/2005 4:59 AM


Production will be online this morning as soon as [senior dba] finishes 
hotpacks. [senior dba] is in charge of the restoration at this point; 
he'll let everyone know the moment you can be online. More to come when 
I'm coherent again.

Other emails later included the words "Bill, Excellent work. Thank you." and "Perfect." in regards to my... unique writing style.

Sadly, my open items list has quadrupled. Happily, I have 5k in spending authorization to make it a bit easier next time.

Here's my wrapup email:

After getting off the phone with the nicest and most competent RAID tech I've ever worked with at Adaptec, here is a story that fits the facts as they are known and is mostly believable.

The affected array contained 11 drives; 9 as part of a standard raid-5, with two hot spares. At first, this array was populated only with Maxtor 18.2gb 10,000 RPM scsi-160 drives. Over the years, drives have failed, and the hot spare has always taken over correctly. The failed drives were replaced, sometimes with drives that are not precisely identical.

Immediately before the incident, there was one drive which was not the same as the others. It was a Maxtor 36gb 15,000 RPM scsi-320 drive, and was part of the 9 drive RAID-5 set. This is the drive that "failed" during the heat spike caused by the loss of the rooftop air conditioner. The firmware of the RAID card was the first revision released, a long long time ago.

The reason this particular drive failed is threefold:

1) The 15,000 RPM drives run hotter. Under normal operating conditions (temp < 90 degrees F), the DuraStor 312R chassis is capable of cooling a full load of twelve 15,000 RPM drives; but conditions in the data center were most certainly not "normal operating conditions" on the night in question.

2) The drive has a temperature diode which spins down the platters when the temperature limit is reached. It is not known if the earlier drives also have this feature. Attempts to access the drive whilst in this state result in a SCSI device busy; retry later type of response (understand that I'm paraphrasing the exact SCSI commands).

3) When the RAID card tries to write to the array, and sends data to that spun down drive, the card will wait for the "data saved successfully" signal. If it is not received within a certain time frame, the write request is repeated. If the repeated request is ignored repeatedly, the drive will be failed by the RAID controller.

Once the drive has been failed, the RAID card writes its failed status across all of the members of the array. Either due to the specific "temporary failure" status, or because of the age of the firmware (this behavior has been modified since), the raid card did not request the hot spare to be activated. At some point in here, something bad happened and we had data corruption and a catatonic server, which may or may not be a direct result of the spun down drive. Possible reasons for the logical corruption boil down to:

1) Drives not all identical
2) Temperature caused it on the server end, and the server wrote the corrupted data separately from the drive problem.
3) Firmware out of date
4) Data has always been corrupted. (ha!)
5) Something else.

After the system was restarted by [tech] that morning, the only information the RAID card had to go on was that it had marked that drive failed in the RAID's status table (mirrored on all drives), not if said drive had a "hard" or "soft" error, so the RAID controller started a rebuild onto a hot spare.

We have since updated the firmware to the latest version on [server]. Adaptec feels that the aberrant behavior of the raid array was a direct result of the heat spike in the data center, having an old firmware reversion, and running drives of a different type in the same raid. Ergo, Adaptec further recommends only using drives of the same era in an array.

It is currently unlikely that permanent damage has occurred, since there have been no errors thus far which are credible in our testing. We have tested both drives which were claimed to have "failed" using the Maxtor SCSIMax software numerous times. Each time, the drives pass successfully. [Tech] and I will beat on the thing some more, but we'll likely recertify it for use in a week or so. Oh, and if we reuse this box for something else, consider limiting the number of physical drives in the array, since a RAID 5 logical drive fails if two physical disks fail, and the more drives you have, the more likely it is both will fail within the rebuild window, since it takes longer to rebuild a single drive the more drives there are in the set. I could prolly explain that better with graphs :)

[User Picture]From: jgp
2005-12-22 09:33 am (UTC)
Yeah. Please don't mention the fact that you're an intern and have a spending authorization too loudly... my intern might overhear it and want one too. =/

I almost chopped off the hand of one of my team members who was trying to swap in a non-identical drive into one of my JBOD's in the office, for reasons #1 and #5 above. I know it'll "work", but it still bothers me, especially when I have spares for every type of drive in the lab in a cabinet. Hell, I yelled at HP and made them re-send me a replacement drive that wasn't the same (I'd given them the part numbers for some 146GB Ultra320 10k drives, and they sent me 15k's back).

We tend not to run hot-spare drives in anything (except the SQL clusters and my source code server), but we've also got a ready supply on hand to hot-swap drives as soon as one lights up red. Of course, that's always a joke too, as the last HP tech I talked to said "If it lights up red it's bad, send it back" and the more current tech said "That only means it's bad about 50% of the time, and with SmartStart (which we don't load) you can use xxx utility to determine what's actually bad". I didn't want to argue with him that if the cable going to the array was bad, I'd expect a.) more drives to be failing, or not working, and b.) the physical "bad" indicator on the drive itself wouldn't be lighting up.

Glad to hear that the mess is cleaned up, tho... hopefully you don't have to wear your pager over the holidays? =)
(Reply) (Thread)