Search This Blog

Wednesday, June 11, 2014

Replace a RAID 5 disk that has failed (Linux / Ubuntu)

If you take a look at my last couple of blog entries, you'd know that I had a hard drive that was approaching imminent failure:
I got the new drive in the mail from Amazon which was a different model, but the same size (2.0 TB) WesternDigital Green (WD20EZRX).  Once I decided that using a SATA 3 drive on a SATA 2 bus was going to work, I went for the purchase.

On to the replacement:
Using mdadm, tell the Linux RAID to not recognize the disk as usable:
mdadm --fail /dev/md0 /dev/sda1

If you are like me, and you don't know which one is which, use the Disk Manager tool and write down the serial number of the drive.  This will correlate to the number on the printed label of the physical drive.  Note: it is handy to tape a piece of paper to the inside of your computer listing all of your drive serial numbers and the associated partition for future reference.  I actually had forgotten that I did this the last time a drive failed, wrote down the serial number of my drive, and then realized the paper was in the computer.

Power down the machine, remove the faulty drive, and replace with the new one.

Once the drive is replaced, power on your computer.  You should see a /dev/md0 fail event upon startup.  Mine said something to the effect of 3 out of 4 devices available, 1 removed.. etc.

Next, format the new drive with fdisk:
sudo fdisk /dev/sda

This will bring you into the fdisk program.  Type m for the help menu and available input options.  Perform these in order:
p - print the current configuration and verify there is no partition already.  This is a quick idiot check to make sure you are configuring the correct drive.
n - new partition
p - make it a primary partition
<enter> - accept the default start sector (should be 2048)
<enter> - accept the default end sector (should be the end of the hard drive)
t - change the type of the partition
fd - make it a Linux RAID autodetect
p - verify all of your settings are correct.  It should look something like this:
w - write your changes to the file

This will write the new partition table, exit fdisk and return you to the command line.  Execute partprobe to ensure your system will recognize the new partition.

Tell mdadm that the drive is now available:
sudo mdadm --add /dev/md0 /dev/sda1

Your data from the other 3 drives will now be rewritten across the new sda1 drive.  This will take some time, but can be monitored:
watch cat /proc/mdstat

It is important to leave your machine on and uninterrupted until the rebuilding process is complete.

Aren't RAID 5's a beauty?  I love having automatic hardware failure protection... assuming not more than 1 drive fails at a time.  I hope you found this useful.  If you have any questions or comments feel free to post in the comments below.

Next up will be to create a RAID 1 using my existing system drive and a spare unused drive I've had sitting around.... without losing any data.  Should be fun!

Tuesday, June 3, 2014

Checking and repairing a RAID in Linux

Recently I've been having a weird issue where I will sit down at my computer after it has not been used in a while and it shows a black screen with a blinking "_" in the top left corner.  The only thing I can do to recover from this is to issue the Alt+Prt Sc+REISUB to force an emergency file system sync and reboot (click the link for details on all the inputs).  Once the machine was back up, I quickly started researching what caused the issue by checking out dmesg and kern.log. I also ran some smartctl tests and noticed there were some bad blocks on my RAID 5 (4x2TB).  I started down the rabbit hole of repairing bad blocks, only to find out I could be causing more harm than good.  I vaguely remember attempting this before on a non-RAID, and ending up with more unusable blocks than when I started.  Before doing too much damage to my RAID, I decided to do some more research.  Turns out, with a Linux software RAID (mdadm), I can easily find and repair my issues using one simple command:

sudo echo 'check' > /sys/block/md0/md/sync_action

Of course, my RAID is on md0, so change this to wherever your mount your disks if different.  It is wise to do this while the volume is not mounted (sudo umount /dev/md0), otherwise you risk damage.  This command will start the filesystem check but will not keep you up to date on its progress.
To check up on the progress, issue:

watch cat /proc/mdstat

This will take a long time, depending on the size of your drives; mine started out with ~290 minutes to finish.  To quit watching, Ctrl+C.

To pause the check:

sudo /usr/share/mdadm/checkarray -x /dev/md0

Resume:

sudo /usr/share/mdadm/checkarray -a /dev/md0

Once it has completed, check the mismatch count:

cat /sys/block/md0/md/mismatch_cnt

If output returns 0, then you're all set and your RAID array should be as repaired as it can be.  If it returns something other than 0, you can synchronize the blocks by issuing:

sudo echo 'repair' > /sys/block/md0/md/sync_action
watch cat /proc/mdstat
And, once the repair is complete, check it again:
sudo echo 'check' > /sys/block/md0/md/sync_action
watch cat /proc/mdstat

For more info, check out the Thomas Krenn Wiki.