[opensuse] Software RAID dieing, help/suggestions please2

I have a software Raid5 with 4 1TB drives. I first found that the system was hung and couldn't get anything on the screen, rebooted and did some poking around before it went down again. I do have a spare drive but not sure which drive it is. Think I read a drive number giving some errors. Right now that PC is off till I get some information and a plan of action. Each time the server is on, its up for about 30 min, then is unresponsive. Bellow is some of the output, I'd appreciate any help, I'm not use to software raid issues. /dev/md0: Version : 00.90.03 Creation Time : Mon Nov 3 01:30:34 2008 Raid Level : raid5 Array Size : 2930285184 (2794.54 GiB 3000.61 GB) Used Dev Size : 976761728 (931.51 GiB 1000.20 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Fri Jan 13 18:28:23 2012 State : active, resyncing Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 1% complete UUID : 9327adce:9242bb1a:5ddee0bc:b536709a Events : 0.11 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 Oh yea openSUSE 11.1 on an intel celeron 600mhz 512mb ram, 3ware 8 SATA in jbod mode. Can I turn off the auto fixing to see if the system becomes stable enough to work on? Thanks, Cody -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

On 2/4/2012 3:28 PM, Cody Nelson wrote:
I have a software Raid5 with 4 1TB drives. I first found that the system was hung and couldn't get anything on the screen, rebooted and did some poking around before it went down again. I do have a spare drive but not sure which drive it is. Think I read a drive number giving some errors. Right now that PC is off till I get some information and a plan of action.
Each time the server is on, its up for about 30 min, then is unresponsive. Bellow is some of the output, I'd appreciate any help, I'm not use to software raid issues.
/dev/md0: Version : 00.90.03 Creation Time : Mon Nov 3 01:30:34 2008 Raid Level : raid5 Array Size : 2930285184 (2794.54 GiB 3000.61 GB) Used Dev Size : 976761728 (931.51 GiB 1000.20 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent
Update Time : Fri Jan 13 18:28:23 2012 State : active, resyncing Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 64K
Rebuild Status : 1% complete
UUID : 9327adce:9242bb1a:5ddee0bc:b536709a Events : 0.11
Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1
Oh yea openSUSE 11.1 on an intel celeron 600mhz 512mb ram, 3ware 8 SATA in jbod mode.
Can I turn off the auto fixing to see if the system becomes stable enough to work on?
Thanks, Cody
No I wouldn't turn off auto fixing. Its corrupt, and who knows what you will end up with if you don't let it rebuild. Its in a rebuild state. Leave it alone, and see if it can recover would be my advice. Put up a top display to make sure it is still rebuilding and leave it alone and keep users off till it digs itself out. Its a celeron for christ sake!!! Maybe go buy some ram? You might want to add the spare to the array now, so that if mdadm decides a drive is toast it can swap in the new one, and maybe want to tail /var/log/warn for a while. (This is why I never put root on raid5, you can always re-install the OS on a fresh drive, but rebuilding a raid 5 can be really painful when it includes / and the system itself.) -- _____________________________________ ---This space for rent--- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

I spent a day, and watch it rebuild, killed off most services, and made backups of family photos and such during this time. (I also pulled the log files) It took ~20hrs to rebuild, and it didn't crash during this time(the previous 2 times it died in about an hr) The system appeared good and stable, but went down the next day. The box is currently off. Hopefully this sat I will be able to follow the directions on here to add the hot spare. I just don't understand how something under /exports/array1/ with no system files could take down the box. I'm going to comb through logs again to see if there is anything else, but so far only the 1 drive reported errors. Thanks again -Cody On Sat, Feb 4, 2012 at 5:39 PM, John Andersen <jsamyth@gmail.com> wrote:
On 2/4/2012 3:28 PM, Cody Nelson wrote:
I have a software Raid5 with 4 1TB drives. I first found that the system was hung and couldn't get anything on the screen, rebooted and did some poking around before it went down again. I do have a spare drive but not sure which drive it is. Think I read a drive number giving some errors. Right now that PC is off till I get some information and a plan of action.
Each time the server is on, its up for about 30 min, then is unresponsive. Bellow is some of the output, I'd appreciate any help, I'm not use to software raid issues.
/dev/md0: Version : 00.90.03 Creation Time : Mon Nov 3 01:30:34 2008 Raid Level : raid5 Array Size : 2930285184 (2794.54 GiB 3000.61 GB) Used Dev Size : 976761728 (931.51 GiB 1000.20 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent
Update Time : Fri Jan 13 18:28:23 2012 State : active, resyncing Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 64K
Rebuild Status : 1% complete
UUID : 9327adce:9242bb1a:5ddee0bc:b536709a Events : 0.11
Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1
Oh yea openSUSE 11.1 on an intel celeron 600mhz 512mb ram, 3ware 8 SATA in jbod mode.
Can I turn off the auto fixing to see if the system becomes stable enough to work on?
Thanks, Cody
No I wouldn't turn off auto fixing. Its corrupt, and who knows what you will end up with if you don't let it rebuild.
Its in a rebuild state. Leave it alone, and see if it can recover would be my advice.
Put up a top display to make sure it is still rebuilding and leave it alone and keep users off till it digs itself out. Its a celeron for christ sake!!!
Maybe go buy some ram?
You might want to add the spare to the array now, so that if mdadm decides a drive is toast it can swap in the new one, and maybe want to tail /var/log/warn for a while.
(This is why I never put root on raid5, you can always re-install the OS on a fresh drive, but rebuilding a raid 5 can be really painful when it includes / and the system itself.)
-- _____________________________________ ---This space for rent--- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Cody Nelson wrote:
I spent a day, and watch it rebuild, killed off most services, and made backups of family photos and such during this time. (I also pulled the log files) It took ~20hrs to rebuild, and it didn't crash during this time(the previous 2 times it died in about an hr) The system appeared good and stable, but went down the next day. [snip] I just don't understand how something under /exports/array1/ with no system files could take down the box. I'm going to comb through logs again to see if there is anything else, but so far only the 1 drive reported errors.
That does seem strange. It might suggest a problem at a more general level. Please post the output of df -h and lspci so we can see at least the basics of your system layout. Are you running smartd on the drives? If so, what does it say? If not, please do run it :) And do you have a mdadm monitor running? (docs are at https://raid.wiki.kernel.org/) [snip]
On Sat, Feb 4, 2012 at 5:39 PM, John Andersen <jsamyth@gmail.com> wrote:
On 2/4/2012 3:28 PM, Cody Nelson wrote:
I have a software Raid5 with 4 1TB drives. I first found that the [snip] State : active, resyncing Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 [snip] Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1
One thing that puzzles me is that you say you have four drives and it says you have four drives, but only three are listed. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Cody Nelson wrote:
I have a software Raid5 with 4 1TB drives. I first found that the system was hung and couldn't get anything on the screen, rebooted and did some poking around before it went down again. I do have a spare drive but not sure which drive it is. Think I read a drive number giving some errors. Right now that PC is off till I get some information and a plan of action.
Each time the server is on, its up for about 30 min, then is unresponsive. Bellow is some of the output, I'd appreciate any help, I'm not use to software raid issues.
/dev/md0: [snip] State : active, resyncing Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 [snip] Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1
What does "cat /proc/mdstat" say? It seems a little odd that you have 4 active and 4 working devices, but that only 3 are shown. -- Per Jessen, Zürich (-11.2°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Hi Cody, Do not despair. You have a 4 disk raid-5 array and one of your hd's seems to be "dying". That is ok. One drive may fail on raid 5 array's without loosing any data. From the information you provided you have no spare devices, so soft-raid is trying to re-sync on your four (3+1) drives and, one of those is causing you problems. You did well by turning of the PC while you get some advice. Here are the that I think you should do: 1 - Map your physical to logical drive assignments. You need to know which physical drive maps to which logical one. If you have a standard kernel with standard hardware, SATA connector 1 on your motherboard should be /dev/sda, SATA connector 2 should be /dev/sdb, SATA3 should be /dev/sdc and SATA4 should be /dev/sdd (check your motherboard documentation on this). This may however not be the case. If you want to be sure, you need to issue a few commands for all your hard drives and look for yourself. - For this, boot your machine with openSUSE recevery DVD, preferably adding the boot option 'raid=noautodetect'. - After it boots, you should have all four drives. Issue a cat /proc/partitions and see if all drives are present. - Then issue the following command: line for each of the drives: hdparm -i /dev/sd{a,b,c,d} | grep SerialNo this will let you know, in logical order, from /dev/sda to /dev/sdd, which physical model/serial number maps to each logical device. Of course that after you wrote the result down, you will need to shutdown your computer, and find out "with your own eyes" which drive has that model/serial number. This part is now done. 2 - Identify which drive is failing. From the output you provided, it seems that /dev/sda, /dev/sdb and /dev/sdc are present and that /dev/sdd has, was for some reason, kicked out of the array so, /dev/sdd is probably the one that is causing you problems. Once again, it may not be and, you should make sure which one is it. In order to accomplish that, you should do the following: - Boot your system as you did on number (1). However you do need to make sure that the array is not rebuilt, preferably not even loaded. in order to accomplish this, after you boot into recovery mode, issue the command: mdadm -S /dev/md0. The is now stopped. - Now, for each drive, try to read the entire content of the disk with the following command: dd if=/dev/sdX of=/dev/null bs=4k (replace the X with the drive you want like a, b, c, or d). This a long process. to speed things up, you may issue for of this command lines simultaneously but, you will need to use four different consoles, one for each hard drive. If on any of this command lines, you notice some kind of errors popping out, like DMA error or error reading "something", you may cancel that command line and take a note that that drive has a problem. - When the previous command completed you need to take a look at the records in / records out output, as they need to be exactly the same for each drive. If any of the drives fail to give the same records, that drive has a problem. If only one drive gives incorrect records, that is (probably) the drive that is causing your issue. If no drive reports any problem you failing drive is probably /dev/sdd. - After you have identified which logical drive is failing and, with the information that you retrieved from procedure number (1), map it to you physical drive. - Poweroff your machine, and disconnect from you motherboard, THAT hard drive. 3 - Restart your array in degraded mode. - Boot your system as you did on number (1). - When it boots, issue the command: cat /proc/mdstat. If the array is present, it must be on a degraded state. You can also use mdadm -D /dev/md0 to check any array state. - If you already have the array in a degraded state go to procedure (4). If not carry on reading. - If you do not have any array present, you must mount it manually. To do that issue the following commands: - mdadm -S /dev/md0 (just in case). - mdadm --assemble --scan --run --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 - Issue a 'cat /proc/mdstat or mdadm -D /dev/md0 to check your current array state. Your array should now be up and running in degraded state. 4 - Perform a filesystem check on your array. - Something like (if ext3), fsck.ext3 -p -y /dev/md0 If you gen an error on this command your array may not have been started. Issue mdadm -R /dev/md0 After this, you should be able to boot your machine and run it on a degraded state. From this moment on, if any drive of the array fails, you will loose all data. 5 - Turn your array into a "full featured" raid5 array again. - Get a new drive with the same or more size as the smallest of your four drives, and plug it onto your machine. - Partition it with fdisk, and make the raid partition 0xfd type. - Issue the command: mdadm --manage --add /dev/md0 /dev/sdx1 (replace X with the letter for the new drive). The array is now being re-synced. When it finishes, you will now have a raid5 array that is, once again "resistant" to ONE drive failure. NOTE1: Hardware problems may be tricky to found and/or isolate. Although less probable, you may have an issue on your RAM, CPU, or Motherboard component, instead of any of the hard drives. To make sure on all of those, you may transfer your entire array to another machine. If the problem does not go away, then the Hard Drives are the cause for your problem. If not, then you have a problem on one of the previous components that I have mentioned. NOTE2: With soft-raid devices, some kind of data scrubbing script procedure is mandatory. Hardware raid based solutions all use data scrubbing procedures. Use it to prevent failures in more than on hard drive at once. Hope this helps you in any way. Regards, Rui On 02/04/2012 11:28 PM, Cody Nelson wrote:
I have a software Raid5 with 4 1TB drives. I first found that the system was hung and couldn't get anything on the screen, rebooted and did some poking around before it went down again. I do have a spare drive but not sure which drive it is. Think I read a drive number giving some errors. Right now that PC is off till I get some information and a plan of action.
Each time the server is on, its up for about 30 min, then is unresponsive. Bellow is some of the output, I'd appreciate any help, I'm not use to software raid issues.
/dev/md0: Version : 00.90.03 Creation Time : Mon Nov 3 01:30:34 2008 Raid Level : raid5 Array Size : 2930285184 (2794.54 GiB 3000.61 GB) Used Dev Size : 976761728 (931.51 GiB 1000.20 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent
Update Time : Fri Jan 13 18:28:23 2012 State : active, resyncing Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 64K
Rebuild Status : 1% complete
UUID : 9327adce:9242bb1a:5ddee0bc:b536709a Events : 0.11
Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1
Oh yea openSUSE 11.1 on an intel celeron 600mhz 512mb ram, 3ware 8 SATA in jbod mode.
Can I turn off the auto fixing to see if the system becomes stable enough to work on?
Thanks, Cody
-- Rui Santos Veni, Vidi, Linux -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (5)
-
Cody Nelson
-
Dave Howorth
-
John Andersen
-
Per Jessen
-
Rui Santos