Warning: some email lines way longer than 72 chars. The Friday 2004-10-15 at 16:44 +0200, Hylton Conacher (ZR1HPC) wrote:
Tnx for the info and sorry for the belated reply. I have enabled smartctl to perform checks by issuing the command, as root:
smartctl -s on /dev/hdb and smartctl -o /dev/hdb
and hope that will be partly sufficient. The next step is not to worry about the HDD physical structure but to concentrate on making sure the data structure is error ie no bad blocks etc ie the e2fsck cmd.
I see you are still confused. Ok, I'll try to explain a bit more. The above line is not needed, I have never used it. What I do is: Manual testing: ----------------------------- Launching sort test: nimrodel:~ # smartctl --test=short /dev/hda smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 1 minutes for test to complete. Use smartctl -X to abort test. After about one minute, I can see the results, using this command: nimrodel:~ # smartctl --log=selftest /dev/hda smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short off-line Completed 00% 5069 - # 2 Short off-line Completed 00% 4424 - # 3 Short off-line Completed 00% 1868 - # 4 Short off-line Completed 00% 345 - The results #1 are those of the last test performed (notice lifetime column). To launch the complete test, I do: nimrodel:~ # smartctl --test=long /dev/hda smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 62 minutes for test to complete. Use smartctl -X to abort test. Notice that I can test simultaneously all my drives, and that I can continue using my computer simultaneously - albeit slower sometimes, when file requests collide with the tests. Doesn't matter, it works. I can check the progress of the tests with this command: nimrodel:~ # smartctl --log=selftest /dev/hda smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short off-line Completed 00% 5069 - # 2 Short off-line Completed 00% 4424 - # 3 Short off-line Completed 00% 1868 - # 4 Short off-line Completed 00% 345 - #21 Extended off-line Test in progress 90% 5069 - Notice that #21 entry is for the current test, of which only 10% has been done. Of course, exact text depends on you HD maker. Finally, I can see the result - I will print here those of my other drive, which had some problems time ago: nimrodel:~ # smartctl --log=selftest /dev/hdb smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short off-line Completed 00% 4915 - # 2 Short off-line Completed: read failure 90% 4272 0x0170169c # 3 Short off-line Completed: read failure 90% 4272 0x0170169c # 4 Short off-line Completed 00% 1909 - # 5 Extended off-line Completed: read failure 90% 1902 0x0060da4e # 6 Short off-line Completed 00% 1902 - # 7 Short off-line Completed 00% 400 - #21 Extended off-line Test in progress 90% 4918 - Notice that it did a read test of the surface and it failed. The HD having space reserved for this eventuality, it relocated the bad records, and I have continued using the same disk for over a year since then. No problem. To see the long information, you should use option "--all" (The man says: "This is equivalent to ´-H -i -c -A -l error -l selftest´ (for SCSI, ´-H -i -A -l error -l selftest´"). A brief one could be "--health". An interesting part of the data is the "Vendor Specific SMART Attributes with Thresholds" table. It has be read with care, it is confusing: nimrodel:~ # smartctl -A /dev/hda smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 060 053 025 Pre-fail Always - 243624074 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1189 5 Reallocated_Sector_Ct 0x0033 097 097 036 Pre-fail Always - 39 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 278640814 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7650 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1846 194 Temperature_Celsius 0x0022 031 049 000 Old_age Always - 31 195 Hardware_ECC_Recovered 0x001a 100 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 Pick one entry, for example. "Raw_Read_Error_Rate". Notice the column "type", it says "Pre-fail". This does NOT mean that my disk is about to fail. It means that if the value is wrong it indicates a pre-failure notice. A wrong value would be below 025 - I think: that is the confusing part. As "health" says it is correct, then it is correct. automated testing: ----------------------------- Enable SuSE service "rcsmartd start". Configuration is done in /etc/smartd.conf. SuSE puts a sample file there. Comment line "DEVICESCAN", and manually list your devices with appropriate lines. For example: # First (primary) ATA/IDE hard disk. Monitor all attributes, enable # automatic online data collection, automatic Attribute autosave, and # do a short self-test every day at 2am, and a long self test # Saturdays at 3am. /dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03) /dev/hdb -a -o on -S on -s (S/../.././02|L/../../6/03) (the above is too verbose for my liking) That's all :-) -- Cheers, Carlos Robinson