[opensuse] SMART daemon question

newer
[opensuse] How to monitor Linux...

Roger Oberholtzer

30 Oct 2008 30 Oct '08

10:36

I have to ask, even though I know the probable answer: I am getting the following log messages on a machine running 11.0. The disk is a IDE disk. Could there be any software issues that cause this message to be given when it should not? Could I have something configured wrong? Or should I trust that a disk is becoming an ex-disk? Device: /dev/sdc, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 50 to 49 Device: /dev/sdc, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 Device: /dev/sda, SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 91 to 1 Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 59 to 56 Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius changed from 43 to 42 Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 56 Device: /dev/sdc, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 49 to 48 Device: /dev/sdc, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 48 I'm not totally familiar with the SMART tools more than these types of messages. Is there a command that could provide more information on what is going on? Thanks for any help. -- Roger Oberholtzer OPQ Systems / Ramböll RST Ramböll Sverige AB Kapellgränd 7 P.O. Box 4205 SE-102 65 Stockholm, Sweden Office: Int +46 8-615 60 20 Mobile: Int +46 70-815 1696 And remember: It is RSofT and there is always something under construction. It is like talking about large city with all constructions finished. Not impossible, but very unlikely. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Show replies by date

Josef Reidinger

30 Oct 30 Oct

11:05

Roger Oberholtzer wrote:

...

I have to ask, even though I know the probable answer: I am getting the following log messages on a machine running 11.0. The disk is a IDE disk. Could there be any software issues that cause this message to be given when it should not? Could I have something configured wrong? Or should I trust that a disk is becoming an ex-disk?

Device: /dev/sdc, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 50 to 49 Device: /dev/sdc, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 Device: /dev/sda, SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 91 to 1 Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 59 to 56 Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius changed from 43 to 42 Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 56 Device: /dev/sdc, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 49 to 48 Device: /dev/sdc, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 48

I'm not totally familiar with the SMART tools more than these types of messages. Is there a command that could provide more information on what is going on?

Thanks for any help.

command is smartctl -a `disk you want` what every line means you can find on internet. (if something change it needn't mean that disk goind to heaven) JR -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

John Andersen

18:52

Josef Reidinger wrote:

...

command is smartctl -a `disk you want`

if something change it > needn't mean that disk goind to heaven

JR

Its been my experience that all disks go to hell, and sometimes in a hand basket. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Per Jessen

13:34

Roger Oberholtzer wrote:

...

I have to ask, even though I know the probable answer: I am getting the following log messages on a machine running 11.0. The disk is a IDE disk. Could there be any software issues that cause this message to be given when it should not? Could I have something configured wrong? Or should I trust that a disk is becoming an ex-disk?

Device: /dev/sdc, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 50 to 49 Device: /dev/sdc, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 Device: /dev/sda, SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 91 to 1 Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 59 to 56 Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius changed from 43 to 42 Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 56 Device: /dev/sdc, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 49 to 48 Device: /dev/sdc, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 48

I'm not totally familiar with the SMART tools more than these types of messages. Is there a command that could provide more information on what is going on?

Try running a selftest on the drive: smartctl -t short /dev/hdx. Then display the results wit: h smartctl -a /dev/hdx -- /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Roger Oberholtzer

14:34

On Thu, 2008-10-30 at 14:34 +0100, Per Jessen wrote:

...

Try running a selftest on the drive: smartctl -t short /dev/hdx. Then display the results wit: h smartctl -a /dev/hdx

Is this test non-destructive in the event of a problem? Can it be done when the disk is being used? The man page implies the 'short' test can be done during normal operation. Anyone done that and lived to tell? It is the root disk on a system. It would have to be, no? As such, I am exercising extreme caution. -- Roger Oberholtzer OPQ Systems / Ramböll RST Ramböll Sverige AB Kapellgränd 7 P.O. Box 4205 SE-102 65 Stockholm, Sweden Office: Int +46 8-615 60 20 Mobile: Int +46 70-815 1696 And remember: It is RSofT and there is always something under construction. It is like talking about large city with all constructions finished. Not impossible, but very unlikely. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Carlos E. R.

20:34

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday, 2008-10-30 at 15:34 +0100, Roger Oberholtzer wrote:

...

On Thu, 2008-10-30 at 14:34 +0100, Per Jessen wrote:

...
Try running a selftest on the drive: smartctl -t short /dev/hdx. Then display the results wit: h smartctl -a /dev/hdx

Is this test non-destructive in the event of a problem?

Yes. It is read only, AFAIK.

...

Can it be done when the disk is being used?

Yes.

...

The man page implies the 'short' test can be done during normal operation. Anyone done that and lived to tell?

Me. The short and the long. It is designed for such use. However... Recent disks (the test is run by the disk itself, not the computer) run a surface test as part of the long test; older disks I think did not. Which means that the disk will be very busy looking at himself, and thus very unresponsive during that phase. Mine looks as if hanged, but it is not. Don't power off or reset the machine till the end. Better if you stops some tasks and daemons during that test (mail, for instance).

...

It is the root disk on a system. It would have to be, no? As such, I am exercising extreme caution.

Running those tests periodically is generally considered a good thing. The messages you saw I don't think are important. Notice that if the temperature changes just one degree from one check to the next, it is reported. This is absurd, IMO. Also the disk is continuously having minute read errors when reading, and correcting them. This is expected and normal. Only when this error rate goes up consistently you have to worry. How do you know when the values are important? I don't know! :-) But simple use --health and the program should tell you. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkkKGmAACgkQtTMYHG2NR9WDwgCffrD7F3aCbWx6g5xQD+aYZ2GU 698AoISvND6ktkLRYKr2OzXSFNgVazQH =HlPm -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Don Raboud

20:48

On Thursday 30 October 2008 02:34:37 pm Carlos E. R. wrote:

...

The messages you saw I don't think are important.

I agree with the exception of this one

...

Device: /dev/sda, SMART Usage Attribute: 199 UDMA_CRC_Error_Count changed from 91 to 1

The others were small changes, but this one is quite large, and it appears something might have really changed. I would follow Per's advice here.

...

Notice that if the temperature changes just one degree from one check to the next, it is reported. This is absurd, IMO.

That is the default but you can change that behavior if you wish. -- Don -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Andrew Joakimsen

23:58

On Thu, Oct 30, 2008 at 4:34 PM, Carlos E. R. wrote:

...

Running those tests periodically is generally considered a good thing.

I agree, but I don't fully trust them. Sometimes they will say a drive is OK but you can hear the motor is on its last legs.

...

Also the disk is continuously having minute read errors when reading, and correcting them. This is expected and normal. Only when this error rate goes up consistently you have to worry.

Read or write errors are *NOT* normal. If the OP is seeing the message such as Raw_Read_Error_Rate, or Hardware_ECC_Recovered every few minutes or even once an hour I would suspect the disk is going bad. Here is my test for replacing a drive, if it is seeing any errors, making odd noises, etc. I look at the cost of the disk (usually < 100 USD) if the data on the drive is worth more than the drive itself or the consequences of the drive failing (vs making an image during off-peak times and replacing the drive), thus bringing the system down until a replacement can be sourced and the OS reinstalled, I replace it. I figure the total replacement cost might be USD 150 (probably a little less) with the drive, imaging the data and sending a technician to replace it. The cost of replacing the drive one morning when the user can not boot their system (and not being able to do their job) is at the minimum least triple that, probably even ten times that. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Carlos E. R.

31 Oct 31 Oct

00:27

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday, 2008-10-30 at 19:58 -0400, Andrew Joakimsen wrote:

...

...
Also the disk is continuously having minute read errors when reading, and correcting them. This is expected and normal. Only when this error rate goes up consistently you have to worry.

Read or write errors are *NOT* normal.

Yes, they are. Magnetization is so small that signals are near the noise level; they use error correction code to yield clean safe data. This is different that sector error. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkkKUP8ACgkQtTMYHG2NR9UdjgCffPde/AgFySsDnuctLmqZtJMJ 7zEAn3aap9eAQ3UZg0BxyNRDPNDRWbBA =VdqT -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Matthias Bach

09:51

Hi! Am Donnerstag, 30. Oktober 2008 21:34 schrieb Carlos E. R.:

...

Notice that if the temperature changes just one degree from one check to the next, it is reported. This is absurd, IMO.

AFAIK it reports a change of 1 in the normalized Temperatur value, that does not have to mean a change of one degree. Regards, Matthias -- Matthias Bach www.marix.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Carlos E. R.

11:29

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday, 2008-10-31 at 10:51 +0100, Matthias Bach wrote:

...

Hi!

Am Donnerstag, 30. Oktober 2008 21:34 schrieb Carlos E. R.:

...
Notice that if the temperature changes just one degree from one check to the next, it is reported. This is absurd, IMO.

AFAIK it reports a change of 1 in the normalized Temperatur value, that does not have to mean a change of one degree.

]> Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius ]> changedfrom 43 to 42 Huh? "Celsius" are temperature degrees in the Celsius scale. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkkK7CkACgkQtTMYHG2NR9WzGwCfZyNUVy1FTgfC8px8YrAVW05L FGsAn18cNxfwPBQYJwCXddVNm6qfDZS8 =VH+R -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Matthias Bach

13:03

Hi! Am Freitag, 31. Oktober 2008 12:29 schrieb Carlos E. R.:

...

On Friday, 2008-10-31 at 10:51 +0100, Matthias Bach wrote:

...
Hi!

Am Donnerstag, 30. Oktober 2008 21:34 schrieb Carlos E. R.:

...
Notice that if the temperature changes just one degree from one check to the next, it is reported. This is absurd, IMO.

AFAIK it reports a change of 1 in the normalized Temperatur value, that does not have to mean a change of one degree.

]> Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius ]> changedfrom 43 to 42

Huh? "Celsius" are temperature degrees in the Celsius scale.

Ah, I see. I have only seen it report change of values in the VALUE field, but not in the RAW VALUE fielt, and VALUE is normalized. Oct 30 21:35:22 tesla00 smartd[3843]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 114 to 115 versus ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 194 Temperature_Celsius 0x0022 114 104 000 Old_age Always - 33 Regards, Matthias -- Matthias Bach www.marix.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Per Jessen

08:29

Roger Oberholtzer wrote:

...

On Thu, 2008-10-30 at 14:34 +0100, Per Jessen wrote:

...
Try running a selftest on the drive: smartctl -t short /dev/hdx. Then display the results wit: h smartctl -a /dev/hdx

Is this test non-destructive in the event of a problem? Can it be done when the disk is being used? The man page implies the 'short' test can be done during normal operation. Anyone done that and lived to tell?

Certainly - I run a short test every day, and a long test on Sundays. /Per -- /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

5666

Age (days ago)

5667

Last active (days ago)

List overview

Download

12 comments

8 participants

participants (8)

Andrew Joakimsen
Carlos E. R.
Don Raboud
John Andersen
Josef Reidinger
Matthias Bach
Per Jessen
Roger Oberholtzer