Hi All, I suspect I have a heat problem or something as one of the drives in my RAID-1 array drops out every few days. But right now, what I'd really like to know is what the md drivers noticed that caused them to drop the drive. Is is possible to obtain information about what caused a RAID element to be dropped or can I only find out the simple fact that this has happened? Should there be something in /var/log/messages, and if so, what do I look for and how do I interpret it? TIA Cheers, Simon "You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Thursday 2005-11-10 at 06:57 -0800, Simon Roberts wrote:
Is is possible to obtain information about what caused a RAID element to be dropped or can I only find out the simple fact that this has happened? Should there be something in /var/log/messages, and if so, what do I look for and how do I interpret it?
Assumming you activated the kernel messages to go to /var/log/messsages or elsewhere, then look for messages containing the string "md0" or whatever device your raid is named. Also, the word "raid" is interesting: Oct 10 20:26:36 nimrodel kernel: raid1: Disk failure on hdb11, disabling device. Oct 10 20:26:36 nimrodel kernel: Operation continuing on 1 devices (a simulatated failure) Also, you should look at the SMART log of the failed device. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDddaItTMYHG2NR9URAnzqAJ4h82hTxbm2lsGRgsQ23MNxYCRoaQCeL7vH y8WXUke0V3anF2zJ7aPECzk= =3NZU -----END PGP SIGNATURE-----
Hmm, yes, that does show the drive being dropped. Thanks.
Unfortunately, it still doesn't tell me why :(
Ah well, I'll work on the supposition that it's heat related for now
and see if extra cooling or repositioning the drives helps.
Thanks for your help,
Cheers,
Simon
--- "Carlos E. R."
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Thursday 2005-11-10 at 06:57 -0800, Simon Roberts wrote:
Is is possible to obtain information about what caused a RAID element to be dropped or can I only find out the simple fact that this has happened? Should there be something in /var/log/messages, and if so, what do I look for and how do I interpret it?
Assumming you activated the kernel messages to go to /var/log/messsages or elsewhere, then look for messages containing the string "md0" or whatever device your raid is named. Also, the word "raid" is interesting:
Oct 10 20:26:36 nimrodel kernel: raid1: Disk failure on hdb11, disabling device. Oct 10 20:26:36 nimrodel kernel: Operation continuing on 1 devices
(a simulatated failure)
Also, you should look at the SMART log of the failed device.
- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD8DBQFDddaItTMYHG2NR9URAnzqAJ4h82hTxbm2lsGRgsQ23MNxYCRoaQCeL7vH y8WXUke0V3anF2zJ7aPECzk= =3NZU -----END PGP SIGNATURE-----
-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com
"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Saturday 2005-11-12 at 07:08 -0800, Simon Roberts wrote:
Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :(
Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps.
But that will show on the S.M.A.R.T. log. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDdycltTMYHG2NR9URAoBqAJ0UOkJ8MekmiyxRhsAyr+QpHzT6HQCeJ1QA YFrNjzG4mmGZm2vxH4tUWDY= =B/MF -----END PGP SIGNATURE-----
Hmm, well, maybe it should, but it doesn't. That was actually my first
thought. But the smart log is clean, the drive tests out ok (long and
short) and shows nothing that suggests any failure of any sort in its
history.
And yet, it's been dropped, not once, but twice, buy md...
#doo doo doo dooh!
Any other thoughts? (At least I know I'm not crazy any more :)
Cheers,
Simon
--- "Carlos E. R."
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Saturday 2005-11-12 at 07:08 -0800, Simon Roberts wrote:
Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :(
Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps.
But that will show on the S.M.A.R.T. log.
- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD8DBQFDdycltTMYHG2NR9URAoBqAJ0UOkJ8MekmiyxRhsAyr+QpHzT6HQCeJ1QA YFrNjzG4mmGZm2vxH4tUWDY= =B/MF -----END PGP SIGNATURE-----
-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com
"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Well, poking around a little more I noticed something about the most
recent failure (this is the third time this has happened in recent
weeks, and happened 36 hours ago). The failure was reported by email at
19:10, at the "exact" same time as smartd sent test messages about both
the raid drives and was, in fact during system startup. There are _no_
messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or
/dev/hde1 (the supposedly failed device). So, now I'm wondering if the
test message sent by smartd is somehow causing the device to be
dropped. Makes no sense but sometimes nonsense is what you have.
Actually, smartd doesn't appear to actually work--I wanted it to run
short and long tests at intervals, but it utterly refuses to actually
run these tests even though it doesn't complain about the format of the
file. Given that, I think I'll kill smartd for now and just run tests
manually for a while, unless, of course, the md array drops my drive
again despite it testing out ok.
Any thoughts?
Cheers,
Simon
--- Simon Roberts
Hmm, well, maybe it should, but it doesn't. That was actually my first thought. But the smart log is clean, the drive tests out ok (long and short) and shows nothing that suggests any failure of any sort in its history.
And yet, it's been dropped, not once, but twice, buy md...
#doo doo doo dooh!
Any other thoughts? (At least I know I'm not crazy any more :)
Cheers, Simon
--- "Carlos E. R."
wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Saturday 2005-11-12 at 07:08 -0800, Simon Roberts wrote:
Hmm, yes, that does show the drive being dropped. Thanks. Unfortunately, it still doesn't tell me why :(
Ah well, I'll work on the supposition that it's heat related for now and see if extra cooling or repositioning the drives helps.
But that will show on the S.M.A.R.T. log.
- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD8DBQFDdycltTMYHG2NR9URAoBqAJ0UOkJ8MekmiyxRhsAyr+QpHzT6HQCeJ1QA YFrNjzG4mmGZm2vxH4tUWDY= =B/MF -----END PGP SIGNATURE-----
-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com
"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz
__________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com
"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Sunday 2005-11-13 at 15:11 -0800, Simon Roberts wrote:
Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.
Weird. Have a look at /var/log/boot.msg, perhaps there is something there.
Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.
What line are you using in /etc/smartd.conf? I have: /dev/hda -H -f -l selftest -l error -C 197 -U 198 -m cer -s (S/../../2|4|6|7/21|L/../../5/22) - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDeGtwtTMYHG2NR9URAiK4AKCEB8l36eg12FhGargCHUMIXqHx1ACfcspq AhwfAdQGUE2aawhA9mXdhUI= =NkSh -----END PGP SIGNATURE-----
Hmm, on closer inspection, I'm wondering if the "failed at boot" idea
is misguided. /var/log/boot.msg seems to suggest that it failed prior
to that reboot not during it. I'll look into this more. Thanks for
pointing out the boot log file, I'd missed that one. Sigh!
My config for smartd is:
/dev/hde -H -l error -l selftest -s (S/../../2|4|6|7/02|L/../../1/01)
-m simon,root -M test
/dev/hdg -H -l error -l selftest -s (S/../../2|4|6|7/02|L/../../1/01)
-m simon,root -M test
I see some differences, but have to get my kids to school early so I'll
be looking into them later.
Many thanks again for your continued help; it's much appreciated :)
Cheers,
Simon
--- "Carlos E. R."
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Sunday 2005-11-13 at 15:11 -0800, Simon Roberts wrote:
Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.
Weird.
Have a look at /var/log/boot.msg, perhaps there is something there.
Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.
What line are you using in /etc/smartd.conf? I have:
/dev/hda -H -f -l selftest -l error -C 197 -U 198 -m cer -s (S/../../2|4|6|7/21|L/../../5/22)
- -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD8DBQFDeGtwtTMYHG2NR9URAiK4AKCEB8l36eg12FhGargCHUMIXqHx1ACfcspq AhwfAdQGUE2aawhA9mXdhUI= =NkSh -----END PGP SIGNATURE-----
-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com
"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
On Sun, 13 Nov 2005 15:11:20 -0800 (PST), you wrote:
Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.
Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.
For what it's worth (not much I fear) Smartd is working here on a raid1. I don't have it start the self-tests - that machine is far too busy for me to lose a second - but I have it set to monitor everything but temperature. I was notified the day before I lost a drive, and I'm warned regularly that my seek time on one unit is a bit erratic. Mike- -- Mornings: Evolution in action. Only the grumpy will survive. -- Please note - Due to the intense volume of spam, we have installed site-wide spam filters at catherders.com. If email from you bounces, try non-HTML, non-encoded, non-attachments.
That's good to know, thanks. It might well be monitoring the general
stuff--I wouldn't know about that as there have been no visible changes
in the general info--however, I do know it's not running the tests.
Well, even that might not be true: here's my logic. When I run a test
manually, and subsequently use smartctl -a to look at the drive, I see
the test logged and the record is kept for, well, a long time :)
However, I see no records of tests that could have been started with
smartd. From that I infer that no tests are being startee.
Anyway, thanks for the reassurance that it's doing something good.
Maybe I won't kill it just yet! It'll probably turn out to be a config
error that doesn't get caught as a syntax problem :( -- these failures
are usually my fault in the end!
Cheers,
Simon
--- Michael W Cocke
On Sun, 13 Nov 2005 15:11:20 -0800 (PST), you wrote:
Well, poking around a little more I noticed something about the most recent failure (this is the third time this has happened in recent weeks, and happened 36 hours ago). The failure was reported by email at 19:10, at the "exact" same time as smartd sent test messages about both the raid drives and was, in fact during system startup. There are _no_ messages in the 19:10 timeframe in /var/log/messages about /dev/md0 or /dev/hde1 (the supposedly failed device). So, now I'm wondering if the test message sent by smartd is somehow causing the device to be dropped. Makes no sense but sometimes nonsense is what you have.
Actually, smartd doesn't appear to actually work--I wanted it to run short and long tests at intervals, but it utterly refuses to actually run these tests even though it doesn't complain about the format of the file. Given that, I think I'll kill smartd for now and just run tests manually for a while, unless, of course, the md array drops my drive again despite it testing out ok.
For what it's worth (not much I fear) Smartd is working here on a raid1. I don't have it start the self-tests - that machine is far too busy for me to lose a second - but I have it set to monitor everything but temperature. I was notified the day before I lost a drive, and I'm warned regularly that my seek time on one unit is a bit erratic.
Mike-
-- Mornings: Evolution in action. Only the grumpy will survive. --
Please note - Due to the intense volume of spam, we have installed site-wide spam filters at catherders.com. If email from you bounces, try non-HTML, non-encoded, non-attachments.
-- Check the headers for your unsubscription address For additional commands send e-mail to suse-linux-e-help@suse.com Also check the archives at http://lists.suse.com Please read the FAQs: suse-linux-e-faq@suse.com
"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Sunday 2005-11-13 at 14:20 -0800, Simon Roberts wrote:
Any other thoughts? (At least I know I'm not crazy any more :)
No... sorry. What about the "Reallocated_Sector_Ct" count in smart? If it is not zero, it could explain it. Also, "Temperature_Celsius" could show if it is too warm. - -- Cheers, Carlos Robinson -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFDd80/tTMYHG2NR9URAoElAJ0aqygHStYhIkgHH5uEnqyuj/IqcQCeKc55 Zqs51+OxXbdQ7EllYNtqTzU= =kuSv -----END PGP SIGNATURE-----
participants (3)
-
Carlos E. R.
-
Michael W Cocke
-
Simon Roberts