Tuesday, June 3, 2014

Checking and repairing a RAID in Linux

Recently I've been having a weird issue where I will sit down at my computer after it has not been used in a while and it shows a black screen with a blinking "_" in the top left corner.  The only thing I can do to recover from this is to issue the Alt+Prt Sc+REISUB to force an emergency file system sync and reboot (click the link for details on all the inputs).  Once the machine was back up, I quickly started researching what caused the issue by checking out dmesg and kern.log. I also ran some smartctl tests and noticed there were some bad blocks on my RAID 5 (4x2TB).  I started down the rabbit hole of repairing bad blocks, only to find out I could be causing more harm than good.  I vaguely remember attempting this before on a non-RAID, and ending up with more unusable blocks than when I started.  Before doing too much damage to my RAID, I decided to do some more research.  Turns out, with a Linux software RAID (mdadm), I can easily find and repair my issues using one simple command:

sudo echo 'check' > /sys/block/md0/md/sync_action

Of course, my RAID is on md0, so change this to wherever your mount your disks if different.  It is wise to do this while the volume is not mounted (sudo umount /dev/md0), otherwise you risk damage.  This command will start the filesystem check but will not keep you up to date on its progress.
To check up on the progress, issue:

watch cat /proc/mdstat

This will take a long time, depending on the size of your drives; mine started out with ~290 minutes to finish.  To quit watching, Ctrl+C.

To pause the check:

sudo /usr/share/mdadm/checkarray -x /dev/md0


sudo /usr/share/mdadm/checkarray -a /dev/md0

Once it has completed, check the mismatch count:

cat /sys/block/md0/md/mismatch_cnt

If output returns 0, then you're all set and your RAID array should be as repaired as it can be.  If it returns something other than 0, you can synchronize the blocks by issuing:

sudo echo 'repair' > /sys/block/md0/md/sync_action
watch cat /proc/mdstat
And, once the repair is complete, check it again:
sudo echo 'check' > /sys/block/md0/md/sync_action
watch cat /proc/mdstat

For more info, check out the Thomas Krenn Wiki.