March 8, 2006

Rescuing hardware RAID (again)

I was asked to perform hardware RAID heroics again tonight. In fact, I’ve been on site about eight hours now (it’s 0630) and I’m still working at it, but I’ve basically got the problem licked.

I put a broken SCSI RAID 5 (six disks, one failed) back together. The general process is:

  1. Get twice as much storage as the size of the array you want to restore.
  2. Boot your friendly Knoppix disc, or other Linux distribution as you see fit. Make sure you can see the array you’re restoring (in JBOD, of course) and the additional storage you’re going to use for recovery.
  3. dd each disk from the broken RAID array to a file on the additional storage. Now you’ve got a backup to work from. You’d hate to screw up the original devices.
  4. Disconnect the original devices. You can work with the “images” you made of them. You’d hate to screw up the original devices.
  5. Pick apart the disks with something like BIEW. I couldn’t get mine to work with large files, so I had to, uh, “bind” the data files to loop devices. You’re looking for some data that spans a block boundary so you can tell what parity algorithm the array is using. Choices are: left symmetric, left asymmetric, right symmetric, right assymmetric. You’ll also need to discover the block size. Both arrays I’ve done this with have used 64KiB block size. Hopefully you can make a good guess at the order the controller put the drives in; for my SCSI RAID this was in order of their IDs, just like I guessed.
  6. Now you’ve likely got a problem. Your logical RAID volume has a partition table and (duh) partitions. We’re using Linux software RAID (md) to stitch your disks back together, and md only semi-recently (kernel 2.6) gained the ability to have partitions on a software raid device. So, you’re going to either need a version of mdadm that lets you use --build with --level=raid5 (which, to my knowledge, doesn’t exist), or else you’ll apparently need to patch raidtools. I don’t have a patch for you; just go into the source and comment out any section that bitches not an MD device! if major(stat_buf.st_rdev) != MD_MAJOR. Miraculously once you stop mkraid from looking at the device major, it works just fine with partitioned MD devices.
  7. cat /proc/devices, look for the major number assigned to mdp. If you don’t find it, I gather your kernel lacks partitions-on-md-devices support. You might need to load some module; this was already done for me on Knoppix, and the number assigned was 254.
  8. mknod /dev/md0 b 254 0 substituting 254 for whatever number you got in the previous step.
  9. Make the individual partition devices by counting up minors from there. I.e., md0p1 is mknod /dev/md0p1 b 254 1. I think you can go up to 63 partitions.
  10. Now you can make an /etc/raidtab. Mine was pretty standard, and a lot like the one in the original awesome post of the guy that gave me this idea in the first place, except this particular night I had a failed-disk. (Don’t worry about what device to supply for the failed-disk; I gave it /dev/missing. mkraid bitches, but it still starts the array.)
  11. Now you can try something like mkraid --force --dangerous-no-resync /dev/md0. (Since I had a failed disk, I don’t think I actually needed --dangerous-no-resync.)
  12. Now take the other half of your additional storage (wondering why I said you need twice the size of the RAID array you’re restoring, weren’t you?), format it, mount md0p1 or whatever, and copy the data over. In my case it was a Windows server, so I made a FAT32 partition (mkdosfs -F 32 /dev/hdf2) and then copied all the data over to that.

When I was done, I had an IDE drive I could jam in a Windows computer to read the data back on to the newly rebuilt server. Trickiest parts are reading the drives by hand to determine how they need to be reassembled, and then making raidtools work with partitioned md devices.

As far as “reading the drives by hand,” that could be automated if I ever took the time to learn more about NTFS. Reading it tonight, and working with a little Python I wrote, I noticed things about it, like two bytes that seem to hang out around the end of every 4KiB (I think) block, and the fact that the second byte before the start of a long file name seems to indicate that file name’s length. Still since there are only four parity types, and probably only about 5-10 block types I can even remotely call “sane,” someone with a little knowledge of the filesystem could automate the process of finding out the parameters without a tremendous amount of effort.