raid1: how to check for filesystem errors?

older
Yast Installation Server on SUSE...

Leen de Braal

2 Sep 2005 2 Sep '05

10:31

I have a raid1 system with 2x 400GB SATA drives. System has been frozen several times in the last 2 weeks. I have been checking memory overnight, no problems, checked harddisks with manufacturertools, no problems found. So I think, that maybe there is something wrong somewhere in the filesystem, but how do I check? Using reiserfsck from rescuedisk on the partitions does not work. Running the rescuedisk does not freeze the system btw. So I do suspect (one of) the harddrives. What can I use to check?? -- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444

Show replies by date

Sandy Drobic

2 Sep 2 Sep

15:44

New subject: [SLE] raid1: how to check for filesystem errors?

Leen de Braal wrote:

...

I have a raid1 system with 2x 400GB SATA drives. System has been frozen several times in the last 2 weeks. I have been checking memory overnight, no problems, checked harddisks with manufacturertools, no problems found. So I think, that maybe there is something wrong somewhere in the filesystem, but how do I check? Using reiserfsck from rescuedisk on the partitions does not work. Running the rescuedisk does not freeze the system btw. So I do suspect (one of) the harddrives. What can I use to check??

Telepathic guess: you might have a driver modules problem. Unfortunately you do not mention what version of Suse you use or what hardware is involved (mainboard, lan adapter, sata controller) or even if it is a hardware raid or software raid. I would try to check the logs shortly before the time you experienced a system freeze, try to find common causes for those freezes. Also try to provoke a system freeze (if it's not a production system in use). You could check the SMART parameters of the disks with smartmon tools. Sandy

Leen de Braal

17:29

New subject: [SLE] raid1: how to check for filesystem errors?

...

Leen de Braal wrote:

...
I have a raid1 system with 2x 400GB SATA drives. System has been frozen several times in the last 2 weeks. I have been checking memory overnight, no problems, checked harddisks with manufacturertools, no problems found. So I think, that maybe there is something wrong somewhere in the filesystem, but how do I check? Using reiserfsck from rescuedisk on the partitions does not work. Running the rescuedisk does not freeze the system btw. So I do suspect (one of) the harddrives. What can I use to check??

Telepathic guess: you might have a driver modules problem. Unfortunately you do not mention what version of Suse you use or what hardware is involved (mainboard, lan adapter, sata controller) or even if it is a hardware raid or software raid.

9.2, kernel is 2.6.8-24.17-default Intel D865PERL mobo with P4-2.4, 512MB RAM software raid, 2x 400GB Seagate SATA Machine has been running fine for over 6 weeks

...

I would try to check the logs shortly before the time you experienced a system freeze, try to find common causes for those freezes. Also try to provoke a system freeze (if it's not a production system in use).

Log gives errors: kernel: ata2: status=0x51 { DriveReady SeekComplete Error } kernel: ata2: error=0x40 { UncorrectableError } This was the first freeze. Hung hard, I pulled power to reboot the machine. After that, it ran for 3 days and gave same errors, this time five pairs of those lines in about 10 minutes, but it kept on logging (cronjobs and MARKs) until I rebooted again. I think it already hung, because I could not log in, I could just type in my username, but machine did not respond with Password:-prompt. Today I have been testing this machine further, and at some point I was running a rescue system, and saw that raid was being resyncd (cat /proc/mdstat). I let it go until it was ready, and now it seems to run fine (at least for the last 2 hours). Will keep it monitored though. This machine is my first with raid, so bear with me seeming stupid about this. Still have some questions: - are those errors due to harddrive problems? - or is it due to misconfiguration in software raid? (done it with yast) - is it something software raid should correct (I expected that raid was for redundancy and corrected errors?)

...

You could check the SMART parameters of the disks with smartmon tools.

These drives seem not to have smart, at least manufacturertools say so

...

Sandy

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444

Sandy Drobic

18:32

New subject: [SLE] raid1: how to check for filesystem errors?

Leen de Braal wrote:

...

...
Telepathic guess: you might have a driver modules problem. Unfortunately you do not mention what version of Suse you use or what hardware is involved (mainboard, lan adapter, sata controller) or even if it is a hardware raid or software raid.

9.2, kernel is 2.6.8-24.17-default Intel D865PERL mobo with P4-2.4, 512MB RAM software raid, 2x 400GB Seagate SATA Machine has been running fine for over 6 weeks

Usually onboard chips of Intel are well supported by Linux. No special sata controller either. So, I don't think it's a driver problem.

...

...
I would try to check the logs shortly before the time you experienced a system freeze, try to find common causes for those freezes. Also try to provoke a system freeze (if it's not a production system in use).

Log gives errors: kernel: ata2: status=0x51 { DriveReady SeekComplete Error } kernel: ata2: error=0x40 { UncorrectableError }

I have seen errors like that when the mainboard bios couldn't support the disk size. The system would write data to a block that the bios remapped to a wrong block (48Bit LBA not supported by bios). The result was difficult to reproduce system freezes and data loss. In any case I would check the Intel site if an update for your mainboard bios is available.

...

This was the first freeze. Hung hard, I pulled power to reboot the machine. After that, it ran for 3 days and gave same errors, this time five pairs of those lines in about 10 minutes, but it kept on logging (cronjobs and MARKs) until I rebooted again. I think it already hung, because I could not log in, I could just type in my username, but machine did not respond with Password:-prompt. Today I have been testing this machine further, and at some point I was running a rescue system, and saw that raid was being resyncd (cat /proc/mdstat). I let it go until it was ready, and now it seems to run fine (at least for the last 2 hours). Will keep it monitored though. This machine is my first with raid, so bear with me seeming stupid about this.

For what it's worth, I don't think you did anything wrong while setting up the raid. I rather suspect the disks are unreliable or maybe the combination of Motherboard+ sata controller / sata disks ist not well balanced. There is a remote possibility that the system might require a more stable power supply unit, but I haven't seen anything in your description to suggest such a high power demand. The only software/configuration options are to set mainboard bios options more conservatively or play with hdparm. It might be worth the time to check the technical documentation of the discs for MTBF.

...

Still have some questions: - are those errors due to harddrive problems?

Very likely. Either entirely hardware or a driver/bios problem.

...

- or is it due to misconfiguration in software raid? (done it with yast)

Unlikely.

...

- is it something software raid should correct (I expected that raid was for redundancy and corrected errors?)

Provided the raid controller doesn't crap out and trashes the raid structure. Though in this case the software raid can't do anything if the controller can't access the drive correctly.

...

...
You could check the SMART parameters of the disks with smartmon tools.

These drives seem not to have smart, at least manufacturertools say so

Whoa! Very strange. Almost all recent disks support smart. Sandy

Leen de Braal

20:12

New subject: [SLE] raid1: how to check for filesystem errors?

...

Leen de Braal wrote:

...
...
Telepathic guess: you might have a driver modules problem. Unfortunately you do not mention what version of Suse you use or what hardware is involved (mainboard, lan adapter, sata controller) or even if it is a hardware raid or software raid.

9.2, kernel is 2.6.8-24.17-default Intel D865PERL mobo with P4-2.4, 512MB RAM software raid, 2x 400GB Seagate SATA Machine has been running fine for over 6 weeks

Usually onboard chips of Intel are well supported by Linux. No special sata controller either. So, I don't think it's a driver problem.

...
...
I would try to check the logs shortly before the time you experienced a system freeze, try to find common causes for those freezes. Also try to provoke a system freeze (if it's not a production system in use).

Log gives errors: kernel: ata2: status=0x51 { DriveReady SeekComplete Error } kernel: ata2: error=0x40 { UncorrectableError }

I have seen errors like that when the mainboard bios couldn't support the disk size. The system would write data to a block that the bios remapped to a wrong block (48Bit LBA not supported by bios). The result was difficult to reproduce system freezes and data loss.

In any case I would check the Intel site if an update for your mainboard bios is available.

Good point, I will check this. Did not think of it yet. Intel has version P21 now, P15 was running. Updated. In releasenotes nothing about SATA or RAID though.

...

For what it's worth, I don't think you did anything wrong while setting up the raid. I rather suspect the disks are unreliable or maybe the combination of Motherboard+ sata controller / sata disks ist not well balanced. There is a remote possibility that the system might require a more stable power supply unit, but I haven't seen anything in your description to suggest such a high power demand.

The only software/configuration options are to set mainboard bios options more conservatively or play with hdparm.

It might be worth the time to check the technical documentation of the discs for MTBF.

...
Still have some questions: - are those errors due to harddrive problems?

Very likely. Either entirely hardware or a driver/bios problem.

...
- or is it due to misconfiguration in software raid? (done it with yast)

Unlikely.

...
- is it something software raid should correct (I expected that raid was for redundancy and corrected errors?)

Provided the raid controller doesn't crap out and trashes the raid structure. Though in this case the software raid can't do anything if the controller can't access the drive correctly.

...
...
You could check the SMART parameters of the disks with smartmon tools.

These drives seem not to have smart, at least manufacturertools say so

Whoa! Very strange. Almost all recent disks support smart.

Yes they do. I just dl'd the latest version of seatools, smarttests report no problems. Will try smartmon tools to see more details. Thanks for helping so far.

...

Sandy

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444

Carlos E. R.

20:36

New subject: [SLE] raid1: how to check for filesystem errors?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2005-09-02 at 19:29 +0200, Leen de Braal wrote:

...

9.2, kernel is 2.6.8-24.17-default

Your system is not up to date. Current kernel for 9.2 is "kernel-default-2.6.8-24.18". The scurity announce mentions some work done on reiserfs, I don't know if it will apply to your case.

...

Intel D865PERL mobo with P4-2.4, 512MB RAM software raid, 2x 400GB Seagate SATA Machine has been running fine for over 6 weeks

SATA is so knew that you might benefit from updating to 9.3 :-?

...

...
You could check the SMART parameters of the disks with smartmon tools.

These drives seem not to have smart, at least manufacturertools say so

Impossible. I can't believe a new drive not having smart support. Try the smartmon tools on the distro. You can launch tests while online and read the results later. But first see the log: smartctl -a /dev/hda |less If there were HD errors, they should show there. What the "manufacturertools" say, perhaps, is that it is not activated (in the bios, I suposse). - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDGLfStTMYHG2NR9URAkZmAKCDnOIKs2xF0tjXu38gZh6RfgdpSQCeJnzw 2AiqSgHIX0WdcKrpLevGK2w= =XOyL -----END PGP SIGNATURE-----

7045

Age (days ago)

7045

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Carlos E. R.
Leen de Braal
Sandy Drobic