Adventures with Linux RAID

RAID rules. If you’re unfamiliar with it, it is where one connects multiple disks together to improve one or more aspects of a single hard disk’s performance. There are a few connection strategies, the simplest and most common of which are:

RAID-0
also known as striping, where data is spread across two (or more) disks to improve read and write performance. For a given data transaction, each drive only needs to do half the work and thus runs twice as fast (for two drives). The array size is equal to the sum of the size of the drives. The downside is a decrease in reliability—if any drive fails, all data on the array is lost.
RAID-1
aka mirroring, where data is duplicated across two or more drives. Read performance is improved, as multiple reads can be shared across drives; write performance is degraded as all drives have to write the data. Reliability is improved: all but one of the drives can fail and the data survives. The kicker is the capacity, and hence the cost per unit storage: the array size is equal to the size of a single drive.

The control can either be done with hardware (a RAID adapter card) or in software. If you want to know more, see the RAID entry at Wikipedia.

The server on which this web site (and several others) resided at the time of this tale was a 1995 Pentium-100 machine (since upgraded to a 1999 PII/450) running Linux, with its data spread across two small (about 1GB) old drives running in a RAID-1 configuration, controlled via software.

A couple of days ago, I noticed this in the system log:

Oct 19 15:04:43 hda: multwrite_intr: status=0x61 { DriveReady DeviceFault Error }
Oct 19 15:04:43 hda: multwrite_intr: error=0x04 { DriveStatusError }
Oct 19 15:05:21 hda: write_intr error1: nr_sectors=8, stat=0x61
Oct 19 15:05:21 hda: write_intr: status=0x61 { DriveReady DeviceFault Error }
Oct 19 15:05:21 hda: write_intr: error=0x04 { DriveStatusError }

…followed by similar messages a couple of hours later, then total drive failure later in the evening:

Oct 19 22:46:49 hda: read_intr: error=0x10 { SectorIdNotFound }, LBAsect=44823, sector=44823
Oct 19 22:46:49 end_request: I/O error, dev hda, sector 44823
Oct 19 22:46:49 raid1: Disk failure on hda1, disabling device.
Oct 19 22:46:49 raid1: hda1: rescheduling sector 44760

Here we see the RAID subsystem taking the affected hard drive hda out of the array, leaving the surviving drive hdc to hold the fort on its own.

I had also arranged the system’s swap partitions as RAID-1, so despite the drive going down, the machine continued to run for a couple of days whilst I got a pair of new 40GB disks from dabs.com and prayed that hdc wouldn’t also go down in the meantime.

I successfully installed the new drives and copied the data across. A warning, though: setting up large multi-gigabyte arrays on old hardware takes a long time. My old server took two hours to sync up an empty 38GB array, which meant a later bedtime than expected for me…

As the failed drive had contained the LILO boot sector, the system had to be booted from an emergency floppy, but using mount root=/dev/md0 at the boot prompt booted up the degraded single-disk array with no problems.

If you use Linux, and are tempted to try out RAID, go for it. It’s free if you already have the drives, and can increase performance and/or reliability. Just make sure you back up important data before you create the arrays, and remember: RAID is not an alternative to keeping backups. Something like a fire or a strong power surge could easily knock out both drives together. Also, if you accidentally delete or corrupt a file, the deletion or corruption is immediately duplicated on the other drive!

RAID books