Re: [SLE] raid1: how to check for filesystem errors?

2 Sep 2005

      Leen de Braal wrote:
...
...
Telepathic guess: you might have a driver modules problem. Unfortunately
you do not mention what version of Suse you use or what hardware is
involved (mainboard, lan adapter, sata controller) or even if it is a
hardware raid or software raid.
9.2, kernel is 2.6.8-24.17-default
Intel D865PERL mobo with P4-2.4, 512MB RAM
software raid, 2x 400GB Seagate SATA
Machine has been running fine for over 6 weeks
Usually onboard chips of Intel are well supported by Linux. No special 
sata controller either. So, I don't think it's a driver problem.
...
...
I would try to check the logs shortly before the time you experienced a
system freeze, try to find common causes for those freezes. Also try to
provoke a system freeze (if it's not a production system in use).
Log gives errors:
kernel: ata2: status=0x51 { DriveReady SeekComplete Error }
kernel: ata2: error=0x40 { UncorrectableError }
I have seen errors like that when the mainboard bios couldn't support the 
disk size. The system would write data to a block that the bios remapped 
to a wrong block (48Bit LBA not supported by bios). The result was 
difficult to reproduce system freezes and data loss.

In any case I would check the Intel site if an update for your mainboard 
bios is available.
...
This was the first freeze. Hung hard, I pulled power to reboot the
machine. After that, it ran for 3 days and gave same errors, this time
five pairs of those lines in about 10 minutes, but it kept on logging
(cronjobs and MARKs) until I rebooted again. I think it already hung,
because I could not log in, I could just type in my username, but machine
did not respond with Password:-prompt.
Today I have been testing this machine further, and at some point I was
running a rescue system, and saw that raid was being resyncd (cat
/proc/mdstat). I let it go until it was ready, and now it seems to run
fine (at least for the last 2 hours).
Will keep it monitored though. This machine is my first with raid, so bear
with me seeming stupid about this.
For what it's worth, I don't think you did anything wrong while setting up 
the raid. I rather suspect the disks are unreliable or maybe the 
combination of Motherboard+ sata controller / sata disks ist not well 
balanced.
There is a remote possibility that the system might require a more stable 
power supply unit, but I haven't seen anything in your description to 
suggest such a high power demand.

The only software/configuration options are to set mainboard bios options 
more conservatively or play with hdparm.

It might be worth the time to check the technical documentation of the 
discs for MTBF.
...
Still have some questions:
- are those errors due to harddrive problems?
Very likely. Either entirely hardware or a driver/bios problem.
...
- or is it due to misconfiguration in software raid? (done it with yast)
Unlikely.
...
- is it something software raid should correct (I expected that raid was
for redundancy and corrected errors?)
Provided the raid controller doesn't crap out and trashes the raid 
structure. Though in this case the software raid can't do anything if the 
controller can't access the drive correctly.
...
...
You could check the SMART parameters of the disks with smartmon tools.
These drives seem not to have smart, at least manufacturertools say so
Whoa! Very strange. Almost all recent disks support smart.

Sandy

Re: [SLE] raid1: how to check for filesystem errors?

Sandy Drobic