RAID degrade info?

older
probably a silly COM port question

Simon Roberts

10 Nov 2005 10 Nov '05

14:57

Hi All, I suspect I have a heat problem or something as one of the drives in my RAID-1 array drops out every few days. But right now, what I'd really like to know is what the md drivers noticed that caused them to drop the drive. Is is possible to obtain information about what caused a RAID element to be dropped or can I only find out the simple fact that this has happened? Should there be something in /var/log/messages, and if so, what do I look for and how do I interpret it? TIA Cheers, Simon "You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com

Show replies by date

Carlos E. R.

12 Nov 12 Nov

12:24

New subject: [SLE] RAID degrade info?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Thursday 2005-11-10 at 06:57 -0800, Simon Roberts wrote:

...

Is is possible to obtain information about what caused a RAID element to be dropped or can I only find out the simple fact that this has happened? Should there be something in /var/log/messages, and if so, what do I look for and how do I interpret it?

Assumming you activated the kernel messages to go to /var/log/messsages or elsewhere, then look for messages containing the string "md0" or whatever device your raid is named. Also, the word "raid" is interesting: Oct 10 20:26:36 nimrodel kernel: raid1: Disk failure on hdb11, disabling device. Oct 10 20:26:36 nimrodel kernel: Operation continuing on 1 devices (a simulatated failure) Also, you should look at the SMART log of the failed device. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDddaItTMYHG2NR9URAnzqAJ4h82hTxbm2lsGRgsQ23MNxYCRoaQCeL7vH y8WXUke0V3anF2zJ7aPECzk= =3NZU -----END PGP SIGNATURE-----

Simon Roberts

15:08

New subject: [SLE] RAID degrade info?

Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :( Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps. Thanks for your help, Cheers, Simon --- "Carlos E. R." wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

The Thursday 2005-11-10 at 06:57 -0800, Simon Roberts wrote:

...
Is is possible to obtain information about what caused a RAID element to be dropped or can I only find out the simple fact that this has happened? Should there be something in /var/log/messages, and if so, what do I look for and how do I interpret it?

Assumming you activated the kernel messages to go to /var/log/messsages or elsewhere, then look for messages containing the string "md0" or whatever device your raid is named. Also, the word "raid" is interesting:

Oct 10 20:26:36 nimrodel kernel: raid1: Disk failure on hdb11, disabling device. Oct 10 20:26:36 nimrodel kernel: Operation continuing on 1 devices

(a simulatated failure)

Also, you should look at the SMART log of the failed device.

- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76

iD8DBQFDddaItTMYHG2NR9URAnzqAJ4h82hTxbm2lsGRgsQ23MNxYCRoaQCeL7vH y8WXUke0V3anF2zJ7aPECzk= =3NZU -----END PGP SIGNATURE-----

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com

Carlos E. R.

13 Nov 13 Nov

15:14

New subject: [SLE] RAID degrade info?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Saturday 2005-11-12 at 07:08 -0800, Simon Roberts wrote:

...

Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :(

Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps.

But that will show on the S.M.A.R.T. log. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDdycltTMYHG2NR9URAoBqAJ0UOkJ8MekmiyxRhsAyr+QpHzT6HQCeJ1QA YFrNjzG4mmGZm2vxH4tUWDY= =B/MF -----END PGP SIGNATURE-----

Simon Roberts

22:20

New subject: [SLE] RAID degrade info?

Hmm, well, maybe it should, but it doesn't. That was actually my first thought. But the smart log is clean, the drive tests out ok (long and short) and shows nothing that suggests any failure of any sort in its history. And yet, it's been dropped, not once, but twice, buy md... #doo doo doo dooh! Any other thoughts? (At least I know I'm not crazy any more :) Cheers, Simon --- "Carlos E. R." wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

The Saturday 2005-11-12 at 07:08 -0800, Simon Roberts wrote:

...
Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :(

Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps.

But that will show on the S.M.A.R.T. log.

- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76

iD8DBQFDdycltTMYHG2NR9URAoBqAJ0UOkJ8MekmiyxRhsAyr+QpHzT6HQCeJ1QA YFrNjzG4mmGZm2vxH4tUWDY= =B/MF -----END PGP SIGNATURE-----

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs

Simon Roberts

23:11

New subject: [SLE] RAID degrade info?

Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have. Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok. Any thoughts? Cheers, Simon --- Simon Roberts wrote:

...

Hmm, well, maybe it should, but it doesn't. That was actually my first thought. But the smart log is clean, the drive tests out ok (long and short) and shows nothing that suggests any failure of any sort in its history.

And yet, it's been dropped, not once, but twice, buy md...

#doo doo doo dooh!

Any other thoughts? (At least I know I'm not crazy any more :)

Cheers, Simon

--- "Carlos E. R." wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

The Saturday 2005-11-12 at 07:08 -0800, Simon Roberts wrote:

...
Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :(

Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps.

But that will show on the S.M.A.R.T. log.

- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76

iD8DBQFDdycltTMYHG2NR9URAoBqAJ0UOkJ8MekmiyxRhsAyr+QpHzT6HQCeJ1QA YFrNjzG4mmGZm2vxH4tUWDY= =B/MF -----END PGP SIGNATURE-----

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz

__________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com

Carlos E. R.

14 Nov 14 Nov

11:04

New subject: [SLE] RAID degrade info?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Sunday 2005-11-13 at 15:11 -0800, Simon Roberts wrote:

...

Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.

Weird. Have a look at /var/log/boot.msg, perhaps there is something there.

...

Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.

What line are you using in /etc/smartd.conf? I have: /dev/hda -H -f -l selftest -l error -C 197 -U 198 -m cer -s (S/../../2|4|6|7/21|L/../../5/22) - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDeGtwtTMYHG2NR9URAiK4AKCEB8l36eg12FhGargCHUMIXqHx1ACfcspq AhwfAdQGUE2aawhA9mXdhUI= =NkSh -----END PGP SIGNATURE-----

Simon Roberts

13:09

New subject: [SLE] RAID degrade info?

Hmm, on closer inspection, I'm wondering if the "failed at boot" idea is misguided. /var/log/boot.msg seems to suggest that it failed prior to that reboot not during it. I'll look into this more. Thanks for pointing out the boot log file, I'd missed that one. Sigh! My config for smartd is: /dev/hde -H -l error -l selftest -s (S/../../2|4|6|7/02|L/../../1/01) -m simon,root -M test /dev/hdg -H -l error -l selftest -s (S/../../2|4|6|7/02|L/../../1/01) -m simon,root -M test I see some differences, but have to get my kids to school early so I'll be looking into them later. Many thanks again for your continued help; it's much appreciated :) Cheers, Simon --- "Carlos E. R." wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

The Sunday 2005-11-13 at 15:11 -0800, Simon Roberts wrote:

...
Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.

Weird.

Have a look at /var/log/boot.msg, perhaps there is something there.

...
Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.

What line are you using in /etc/smartd.conf? I have:

/dev/hda -H -f -l selftest -l error -C 197 -U 198 -m cer -s (S/../../2|4|6|7/21|L/../../5/22)

- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76

iD8DBQFDeGtwtTMYHG2NR9URAiK4AKCEB8l36eg12FhGargCHUMIXqHx1ACfcspq AhwfAdQGUE2aawhA9mXdhUI= =NkSh -----END PGP SIGNATURE-----

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

Michael W Cocke

12:54

New subject: [SLE] RAID degrade info?

On Sun, 13 Nov 2005 15:11:20 -0800 (PST), you wrote:

...

Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.

Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.

For what it's worth (not much I fear) Smartd is working here on a raid1. I don't have it start the self-tests - that machine is far too busy for me to lose a second - but I have it set to monitor everything but temperature. I was notified the day before I lost a drive, and I'm warned regularly that my seek time on one unit is a bit erratic. Mike- -- Mornings: Evolution in action. Only the grumpy will survive. -- Please note - Due to the intense volume of spam, we have installed site-wide spam filters at catherders.com. If email from you bounces, try non-HTML, non-encoded, non-attachments.

Simon Roberts

13:13

New subject: [SLE] RAID degrade info?

That's good to know, thanks. It might well be monitoring the general stuff--I wouldn't know about that as there have been no visible changes in the general info--however, I do know it's not running the tests. Well, even that might not be true: here's my logic. When I run a test manually, and subsequently use smartctl -a to look at the drive, I see the test logged and the record is kept for, well, a long time :) However, I see no records of tests that could have been started with smartd. From that I infer that no tests are being startee. Anyway, thanks for the reassurance that it's doing something good. Maybe I won't kill it just yet! It'll probably turn out to be a config error that doesn't get caught as a syntax problem :( -- these failures are usually my fault in the end! Cheers, Simon --- Michael W Cocke wrote:

...

On Sun, 13 Nov 2005 15:11:20 -0800 (PST), you wrote:

...
Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.

Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.

For what it's worth (not much I fear) Smartd is working here on a raid1. I don't have it start the self-tests - that machine is far too busy for me to lose a second - but I have it set to monitor everything but temperature. I was notified the day before I lost a drive, and I'm warned regularly that my seek time on one unit is a bit erratic.

Mike-

-- Mornings: Evolution in action. Only the grumpy will survive. --

Please note - Due to the intense volume of spam, we have installed site-wide spam filters at catherders.com. If email from you bounces, try non-HTML, non-encoded, non-attachments.

-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com

Carlos E. R.

01:12

New subject: [SLE] RAID degrade info?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Sunday 2005-11-13 at 14:20 -0800, Simon Roberts wrote:

...

Any other thoughts? (At least I know I'm not crazy any more :)

No... sorry. What about the "Reallocated_Sector_Ct" count in smart? If it is not zero, it could explain it. Also, "Temperature_Celsius" could show if it is too warm. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDd80/tTMYHG2NR9URAoElAJ0aqygHStYhIkgHH5uEnqyuj/IqcQCeKc55 Zqs51+OxXbdQ7EllYNtqTzU= =kuSv -----END PGP SIGNATURE-----

6751

Age (days ago)

6755

Last active (days ago)

List overview

Download

10 comments

3 participants

participants (3)

Carlos E. R.
Michael W Cocke
Simon Roberts