Re: [SLE] Re: {SLE} SMART HDD technology was Re: [SLE] e2fsck command

18 Oct 2004

      Warning: some email lines way longer than 72 chars.

The Friday 2004-10-15 at 16:44 +0200, Hylton Conacher (ZR1HPC) wrote:
...
Tnx for the info and sorry for the belated reply. I have enabled smartctl to
perform checks by issuing the command, as root:
smartctl -s on /dev/hdb and smartctl -o /dev/hdb
and hope that will be partly sufficient. The next step is not to worry about
the HDD physical structure but to concentrate on making sure the data
structure is error ie no bad blocks etc ie the e2fsck cmd.
I see you are still confused. Ok, I'll try to explain a bit more. The 
above line is not needed, I have never used it. What I do is:

    Manual testing:
-----------------------------

Launching sort test:  

	nimrodel:~ # smartctl --test=short /dev/hda     
	smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
	Sending command: "Execute SMART Short self-test routine immediately in 
	off-line mode".
	Drive command "Execute SMART Short self-test routine immediately in 
	off-line mode" successful.
	Testing has begun.
	Please wait 1 minutes for test to complete.
	Use smartctl -X to abort test.

After about one minute, I can see the results, using this command:

	nimrodel:~ # smartctl --log=selftest /dev/hda     
	smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF READ SMART DATA SECTION ===
	SMART Self-test log, version number 1
	Num  Test_Description    Status                  Remaining  
	LifeTime(hours)  LBA_of_first_error
	# 1  Short off-line      Completed                     00%      5069         -
	# 2  Short off-line      Completed                     00%      4424         -
	# 3  Short off-line      Completed                     00%      1868         -
	# 4  Short off-line      Completed                     00%       345         -

The results #1 are those of the last test performed (notice lifetime 
column). To launch the complete test, I do:

	nimrodel:~ # smartctl --test=long /dev/hda
	smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
	Sending command: "Execute SMART Extended self-test routine immediately in 
	off-line mode".
	Drive command "Execute SMART Extended self-test routine immediately in 
	off-line mode" successful.
	Testing has begun.
	Please wait 62 minutes for test to complete.
	Use smartctl -X to abort test.

Notice that I can test simultaneously all my drives, and that I can 
continue using my computer simultaneously - albeit slower sometimes, when 
file requests collide with the tests. Doesn't matter, it works.

I can check the progress of the tests with this command:

	nimrodel:~ # smartctl --log=selftest /dev/hda
	smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF READ SMART DATA SECTION ===
	SMART Self-test log, version number 1
	Num  Test_Description    Status                  Remaining  
	LifeTime(hours)  LBA_of_first_error
	# 1  Short off-line      Completed                     00%      5069         -
	# 2  Short off-line      Completed                     00%      4424         -
	# 3  Short off-line      Completed                     00%      1868         -
	# 4  Short off-line      Completed                     00%       345         -
	#21  Extended off-line   Test in progress              90%      5069         -

Notice that #21 entry is for the current test, of which only 10% has been 
done. Of course, exact text depends on you HD maker. Finally, I can see 
the result - I will print here those of my other drive, which had some 
problems time ago:

	nimrodel:~ # smartctl --log=selftest /dev/hdb
	smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF READ SMART DATA SECTION ===
	SMART Self-test log, version number 1
	Num  Test_Description    Status                  Remaining  
	LifeTime(hours)  LBA_of_first_error
	# 1  Short off-line      Completed                     00%      4915         -
	# 2  Short off-line      Completed: read failure       90%      4272         0x0170169c
	# 3  Short off-line      Completed: read failure       90%      4272         0x0170169c
	# 4  Short off-line      Completed                     00%      1909         -
	# 5  Extended off-line   Completed: read failure       90%      1902         0x0060da4e
	# 6  Short off-line      Completed                     00%      1902         -
	# 7  Short off-line      Completed                     00%       400         -
	#21  Extended off-line   Test in progress              90%      4918         -

Notice that it did a read test of the surface and it failed. The HD having
space reserved for this eventuality, it relocated the bad records, and I
have continued using the same disk for over a year since then. No problem.

To see the long information, you should use option "--all" (The man says:
"This is equivalent to ´-H -i -c -A -l error -l selftest´ (for SCSI, ´-H
-i -A -l error -l selftest´"). A brief one could be "--health".

An interesting part of the data is the "Vendor Specific SMART Attributes 
with Thresholds" table. It has be read with care, it is confusing:

	nimrodel:~ # smartctl -A /dev/hda
	smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen
	Home page is http://smartmontools.sourceforge.net/

	=== START OF READ SMART DATA SECTION ===
	SMART Attributes Data Structure revision number: 16
	Vendor Specific SMART Attributes with Thresholds:
	ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
	WHEN_FAILED RAW_VALUE
	  1 Raw_Read_Error_Rate     0x000f   060   053   025    Pre-fail  Always       -       243624074
	  3 Spin_Up_Time            0x0003   097   096   000    Pre-fail  Always       -       0
	  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1189
	  5 Reallocated_Sector_Ct   0x0033   097   097   036    Pre-fail  Always       -       39
	  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       278640814
	  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7650
	 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
	 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1846
	194 Temperature_Celsius     0x0022   031   049   000    Old_age   Always       -       31
	195 Hardware_ECC_Recovered  0x001a   100   253   000    Old_age   Always       -       0
	197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
	198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
	199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
	200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
	202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

Pick one entry, for example. "Raw_Read_Error_Rate". Notice the column 
"type", it says "Pre-fail". This does NOT mean that my disk is about to 
fail. It means that if the value is wrong it indicates a pre-failure 
notice. A wrong value would be below 025 - I think: that is the confusing 
part. As "health" says it is correct, then it is correct.

    automated testing:
-----------------------------

Enable SuSE service "rcsmartd start". Configuration is done in 
/etc/smartd.conf. SuSE puts a sample file there. Comment line "DEVICESCAN", 
and manually list your devices with appropriate lines. For example:

# First (primary) ATA/IDE hard disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# do a short self-test every day at 2am, and a long self test
# Saturdays at 3am.
/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)
/dev/hdb -a -o on -S on -s (S/../.././02|L/../../6/03)

(the above is too verbose for my liking)

That's all :-)

-- 
Cheers,
       Carlos Robinson

Re: [SLE] Re: {SLE} SMART HDD technology was Re: [SLE] e2fsck command

Carlos E. R.