Re: [opensuse] SMART Prefailure

18 Mar 2009

      Khawar Nehal wrote:
...
Your harddisk is about to fail. Make a backup and replace it ASAP.
Carlos E. R. wrote:
...
On Saturday, 2009-03-07 at 20:35 +0100, Sascha 'saigkill' Manns wrote:
...
Hello Mates,
today i found in my logdigest the following:
Messages matching keywords in the "alarming" list:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
(1 lines)
Mar  7 12:01:52 linux-eh47 smartd[3228]: Device: /dev/sdb [SAT], SMART
Prefailure Attribute: 10 Spin_Retry_Count changed from 231 to 232
What says me that?
A priori, only that a value has changed.
Make sure that smart is configured to send emails to you on alarms.
The file is "/etc/smartd.conf". I'm not sure it is configured by
default to do so. Then, you can heed those warnings, and ignore logdigest.
Now, to see the full smartd log, run as root:
smartctl -a /dev/sdb
I think you have to examine the column "WHEN_FAILED". You will see
that one of the lines is similar to the one you saw in the logdigest.
Some of the attributes are "Pre-fail", which mean they warn about
impending failure, and others "Old_age". There was a thread recently
about this with links to more info.
The command "smartctl -H /dev/sdb" will tell you fast how things are.
You could also run the long disk self test:
smartctl --test=short /dev/sdb
and when it finishes, the long one:
smartctl --test=long /dev/sdb
The result of the test you can look with "smartctl -a /dev/sdb" (and
other combinations, see the manual - it's good, for a change :-) )
-- Cheers,
       Carlos E. R.
Well,

	You can take some of what smartd says with a grain of salt because many of the
errors are not errors of impending doom.... I have an old dell box with a 100G
drive that smart believes *has been* dying for the past 6 years.

	For example if you read part of the output it looks bad:

SMART Error Log Version: 1
ATA Error Count: 79 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

	Looking at the 79th error, I see:

Error 79 occurred at disk power-on lifetime: 14727 hours (613 days + 15 hours)
  When the command that caused the error occurred, the device was in an unknown
state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  57 4a 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  a0 00 00 00 00 00 a0 00      05:00:29.671  PACKET
  a0 00 04 42 ed e0 ea 00      05:00:24.914  PACKET
  a0 00 00 00 00 00 e0 00      05:00:24.912  PACKET
  a0 00 00 00 00 00 a0 00      05:00:24.904  PACKET
  a0 03 08 00 00 00 a0 00      05:00:24.898  PACKET

	The next logical question, "Is this a failure of recent onset that I should be
concerned about?" We know it happed at 14727 hours (613 days + 15 hours) into
the drives lifetime, so what is it now? Looking at the current output gives us
the answer:

  9 Power_On_Hours          0x0032   200   200   000    Old_age   Always
-       58133

	We are now at 58133 hours. So the last smart error occurred 43406 hours ago
(or roughly 1809 days or 4 years 348.6 days ago). Regardless, every day in the
logs, the smartd error keeps appearing. I think I'm safe with this one. What
does the time correlation show on your drive?

-- 
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org