First off, let me say that no one in their right mind should ever use a software RAID. Never! At least not in a production environment. You can do it at home all you want or if you really hate yourself and want to support this stuff yourself and deal with the headaches. Ina real environment if you have a need for RAID, pony up the money and get a real hardware RAID controller. Like I like to say, if you want to play with the big boys you need big-boy toys.
In any case, I ran across a DIY setup of Openfiler used as an ISCSI target. While Openfiler looks like a great system, I would never use it in a production environment unless the company has purchased support from Openfiler. Unless, of course the system was non-critical. I never want someone breathing down my neck because their systems are down and there is something wrong inside our DIY SAN… Oh, and this Openfiler system also has not been updated in quite a few years.
Inside the system there were 4 SATA disks, each 1TB in size. On each disk there were 4 partitions, and each set of partitions was one MD device (more on this later). One disk had failed (as was visible from smartctl) and my md3 had gone bonkers and “forgot” about all the other disks in its RAID5 array…
1
kernel: Buffer I/O error on device md3, logical block 0
The above errors were also followed by a lot of errors complaining that /dev/sdc was resetting and acting funky. Unfortunately I didn’t save the logs about /dev/sdc. After seeing the errors in dmesg, and looking at all the outputs of mdadm —misc —detail /dev/mdX, I noted the 4 drives were /dev/sda, sdb, sdc, sdd. The dead one was sdc. Just for kicks I looked at the SMART output of all the disks like so:
1
smartctl -a /dev/sdX |less
And when I got to sdc I found:
1234
smartctl -a /dev/sdc |less
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED
If you look at the very top of smartctl’s output you will see the model and serial number of the drive. Now you can remove the offending physical drive and replace it with a spare.
After replacing and of course a reboot (because software RAID is not hot-swapable) let’s do some diagnostics:
[root@NAS ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid10] [raid1]
md2 : active raid5 sdd2[3] sdb2[2] sda2[0]
2047488 blocks level 5, 256k chunk, algorithm 2 [4/3] [U_UU]
md1 : active raid5 sdd3[3] sdb3[2] sda3[0]
2047488 blocks level 5, 256k chunk, algorithm 2 [4/3] [U_UU]
md3 : inactive sdc4[0]
975289984 blocks
md0 : active raid1 sdd1[3] sdb1[2] sda1[0]
104320 blocks [4/3] [U_UU]
unused devices: <none>
[root@NAS ~]# mdadm --misc --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Mon Jun 22 13:02:20 2009
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Used Dev Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Wed Jul 25 04:02:41 2012
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
UUID : e0068cb2:4b28ece0:acb7027f:d0e9fe04
Events : 0.300
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 0 0 1 removed
2 8 17 2 active sync /dev/sdb1
3 8 49 3 active sync /dev/sdd1
[root@NAS ~]# mdadm --misc --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Mon Jun 22 13:02:17 2009
Raid Level : raid5
Array Size : 2047488 (1999.84 MiB 2096.63 MB)
Used Dev Size : 682496 (666.61 MiB 698.88 MB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Tue Jul 24 17:49:28 2012
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 256K
UUID : 2aa930c4:81a86e80:779bb96e:10f00225
Events : 0.12
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 0 0 1 removed
2 8 19 2 active sync /dev/sdb3
3 8 51 3 active sync /dev/sdd3
[root@NAS ~]# mdadm --misc --detail /dev/md2
/dev/md2:
Version : 00.90.03
Creation Time : Mon Jun 22 13:02:17 2009
Raid Level : raid5
Array Size : 2047488 (1999.84 MiB 2096.63 MB)
Used Dev Size : 682496 (666.61 MiB 698.88 MB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Wed Jul 25 09:02:51 2012
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 256K
UUID : e2155b4d:3c6de8f8:1daf70df:e11f16e4
Events : 0.397292
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 0 0 1 removed
2 8 18 2 active sync /dev/sdb2
3 8 50 3 active sync /dev/sdd2
[root@NAS ~]# mdadm --misc --detail /dev/md3
/dev/md3:
Version : 00.90.03
Creation Time : Fri Apr 10 14:13:18 2009
Raid Level : raid5
Used Dev Size : 975289984 (930.11 GiB 998.70 GB)
Raid Devices : 4
Total Devices : 1
Preferred Minor : 3
Persistence : Superblock is persistent
Update Time : Sun Jun 24 06:22:15 2012
State : active, degraded, Not Started
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
UUID : 3b78b4b1:737b20b7:bde21cba:08717914
Events : 0.93039
Number Major Minor RaidDevice State
0 8 36 0 active sync /dev/sdc4
1 0 0 1 removed
2 0 0 2 removed
3 0 0 3 removed
As you can see, all the other RAID5’s lost just their /dev/sdcX devices, while /dev/md3 lost all its other drives but thinks that /dev/sdc4 is the only member. Let me just add here that the new sdc is from another openfiler NAS that had a motherboard failure. This other NAS had the same exact partition layout as our NAS with the dead drive. So we need to fix up our md3 RAID5… but first we need to add the new disk to the healthy(ish) RAID5 arrays:
That’s all. Now we are just waiting for the rebuild to complete to see if all the data is there. I will update this when the rebuild is done. UPDATE: We still lost data but were able to recover some stuff. you may have better luck than I did.