Mailinglist Archive: opensuse (4020 mails)
| < Previous | Next > |
Re: [SLE] Re: {SLE} SMART HDD technology was Re: [SLE] e2fsck command
- From: "Carlos E. R." <robin1.listas@xxxxxxxxxx>
- Date: Mon, 18 Oct 2004 13:07:25 +0200 (CEST)
- Message-id: <Pine.LNX.4.58.0410181231450.9196@xxxxxxxxxxxxxxxx>
Warning: some email lines way longer than 72 chars.
The Friday 2004-10-15 at 16:44 +0200, Hylton Conacher (ZR1HPC) wrote:
> Tnx for the info and sorry for the belated reply. I have enabled smartctl to
> perform checks by issuing the command, as root:
>
> smartctl -s on /dev/hdb and smartctl -o /dev/hdb
>
> and hope that will be partly sufficient. The next step is not to worry about
> the HDD physical structure but to concentrate on making sure the data
> structure is error ie no bad blocks etc ie the e2fsck cmd.
I see you are still confused. Ok, I'll try to explain a bit more. The
above line is not needed, I have never used it. What I do is:
Manual testing:
-----------------------------
Launching sort test:
nimrodel:~ # smartctl --test=short /dev/hda
smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in
off-line mode".
Drive command "Execute SMART Short self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Use smartctl -X to abort test.
After about one minute, I can see the results, using this command:
nimrodel:~ # smartctl --log=selftest /dev/hda
smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log, version number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short off-line Completed 00% 5069 -
# 2 Short off-line Completed 00% 4424 -
# 3 Short off-line Completed 00% 1868 -
# 4 Short off-line Completed 00% 345 -
The results #1 are those of the last test performed (notice lifetime
column). To launch the complete test, I do:
nimrodel:~ # smartctl --test=long /dev/hda
smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in
off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in
off-line mode" successful.
Testing has begun.
Please wait 62 minutes for test to complete.
Use smartctl -X to abort test.
Notice that I can test simultaneously all my drives, and that I can
continue using my computer simultaneously - albeit slower sometimes, when
file requests collide with the tests. Doesn't matter, it works.
I can check the progress of the tests with this command:
nimrodel:~ # smartctl --log=selftest /dev/hda
smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log, version number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short off-line Completed 00% 5069 -
# 2 Short off-line Completed 00% 4424 -
# 3 Short off-line Completed 00% 1868 -
# 4 Short off-line Completed 00% 345 -
#21 Extended off-line Test in progress 90% 5069 -
Notice that #21 entry is for the current test, of which only 10% has been
done. Of course, exact text depends on you HD maker. Finally, I can see
the result - I will print here those of my other drive, which had some
problems time ago:
nimrodel:~ # smartctl --log=selftest /dev/hdb
smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log, version number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short off-line Completed 00% 4915 -
# 2 Short off-line Completed: read failure 90% 4272 0x0170169c
# 3 Short off-line Completed: read failure 90% 4272 0x0170169c
# 4 Short off-line Completed 00% 1909 -
# 5 Extended off-line Completed: read failure 90% 1902 0x0060da4e
# 6 Short off-line Completed 00% 1902 -
# 7 Short off-line Completed 00% 400 -
#21 Extended off-line Test in progress 90% 4918 -
Notice that it did a read test of the surface and it failed. The HD having
space reserved for this eventuality, it relocated the bad records, and I
have continued using the same disk for over a year since then. No problem.
To see the long information, you should use option "--all" (The man says:
"This is equivalent to ´-H -i -c -A -l error -l selftest´ (for SCSI, ´-H
-i -A -l error -l selftest´"). A brief one could be "--health".
An interesting part of the data is the "Vendor Specific SMART Attributes
with Thresholds" table. It has be read with care, it is confusing:
nimrodel:~ # smartctl -A /dev/hda
smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 060 053 025 Pre-fail Always - 243624074
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1189
5 Reallocated_Sector_Ct 0x0033 097 097 036 Pre-fail Always - 39
7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 278640814
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7650
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1846
194 Temperature_Celsius 0x0022 031 049 000 Old_age Always - 31
195 Hardware_ECC_Recovered 0x001a 100 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
Pick one entry, for example. "Raw_Read_Error_Rate". Notice the column
"type", it says "Pre-fail". This does NOT mean that my disk is about to
fail. It means that if the value is wrong it indicates a pre-failure
notice. A wrong value would be below 025 - I think: that is the confusing
part. As "health" says it is correct, then it is correct.
automated testing:
-----------------------------
Enable SuSE service "rcsmartd start". Configuration is done in
/etc/smartd.conf. SuSE puts a sample file there. Comment line "DEVICESCAN",
and manually list your devices with appropriate lines. For example:
# First (primary) ATA/IDE hard disk. Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# do a short self-test every day at 2am, and a long self test
# Saturdays at 3am.
/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)
/dev/hdb -a -o on -S on -s (S/../.././02|L/../../6/03)
(the above is too verbose for my liking)
That's all :-)
--
Cheers,
Carlos Robinson
| < Previous | Next > |