[opensuse] Hard Drives Are Such Interesting Devices!
Hi, This is a follow-up to the thread "Hard Disk Failing." To recap, SMART reported drive errors of the "...XYZ..." variety on a young and lightly used Western Digital Raptor drive. It turned out (see below) that any attempt to access any of sectors 261200 through 261343 (a 144-sector range) would trigger retries that ultimately failed. SMART self-tests likewise failed upon reaching the first of these sectors. Reading some articles on SMART by Bruce Allen (the author of the smartmontools package) suggested that these errors can sometimes be caused by mere discrepancy between the ECC data and the 512 bytes of actual recorded content of a given sector and that there could be many causes for this, including power failures while writing. I decided to try a simple experiment: I would determine all the sectors that elicited an error when they were read and then rewrite them. I did this by using the "dd_rescue" utility. One of its options (-o) records a list of blocks for which unrecoverable errors were reported by the OS. This is how I obtained the list of 144 sectors that showed read errors. Note: dd_rescue is apparently not designed to write to /dev/null, and every write operation it attempts to /dev/null yields an error message. Once I had the list of (supposedly) bad blocks, I simply used an invocation of "dd" (the stock dd, not dd_rescue) to copy zero bytes (supplied by /dev/zero, of course) over the failing sectors. Voila! After this, the bad sectors could be read without eliciting any error indication at all, requiring no retries nor producing any kernel messages. The moral: Don't give up easily if you have a young, expensive drive that starts to give you SMART errors! An interesting aside: The actual capacity of this drive appears to be nearly 7 GB (out of just under 140 GB) _larger_ than specified. Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 09 November 2007 16:09, Randall R Schulz wrote:
Hi,
This is a follow-up to the thread "Hard Disk Failing."
To recap, SMART reported drive errors of the "...XYZ..." variety ...
Sorry. I meant to look up the specific error attribute names and fill them in. They are: - Current_Pending_Sector - Offline_Uncorrectable Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 * Randall R Schulz <rschulz@sonic.net> [11-09-07 19:12]:
invocation of "dd" (the stock dd, not dd_rescue) to copy zero bytes (supplied by /dev/zero, of course) over the failing sectors.
Voila! After this, the bad sectors could be read without eliciting any error indication at all, requiring no retries nor producing any kernel messages.
The moral: Don't give up easily if you have a young, expensive drive that starts to give you SMART errors!
An interesting aside: The actual capacity of this drive appears to be nearly 7 GB (out of just under 140 GB) _larger_ than specified.
only possible problem, part of those *may* have been deliberately scrambled/disabled by the mfgr because they did not meet some standard and were expected to fail untimely (too soooon). - -- Patrick Shanahan Plainfield, Indiana, USA HOG # US1244711 http://wahoo.no-ip.org Photo Album: http://wahoo.no-ip.org/gallery2 Registered Linux User #207535 @ http://counter.li.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn4472 (GNU/Linux) iD4DBQFHNPicClSjbQz1U5oRAjDNAJ4/3HzQ6vpljp2Ky1SDeCVjk5wnLQCWM5XF Pp4LIUDu6jIZZDC7xG3MiA== =fdCf -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 09 November 2007 16:17, Patrick Shanahan wrote:
* Randall R Schulz <rschulz@sonic.net> [11-09-07 19:12]:
...
An interesting aside: The actual capacity of this drive appears to be nearly 7 GB (out of just under 140 GB) _larger_ than specified.
only possible problem, part of those *may* have been deliberately scrambled/disabled by the mfgr because they did not meet some standard and were expected to fail untimely (too soooon).
If so, would they not already have been mapped out of the normal LBA addressing scheme? What it might mean is that they are the unused portion of the reserve capacity designed into the device to be potentially (but in this case not actually) used to handle remapped, erroneous sectors by the factory's burn-in, testing and validation process.
-- Patrick Shanahan
Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 09 November 2007 16:09, Randall R Schulz wrote:
...
An interesting aside: The actual capacity of this drive appears to be nearly 7 GB (out of just under 140 GB) _larger_ than specified.
Here's another interesting observation: These seven percent are very close to the ratio of 1024 * 1024 * 1024 / 1000 * 1000 * 1000 In other words, the drive's nominal 150 "gigabyte" capacity is actually 150 gibibytes. RRS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 09 November 2007 16:09, Randall R Schulz wrote:
Hi,
This is a follow-up to the thread "Hard Disk Failing."
...
A final note: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4810 - # 2 Extended offline Completed: read failure 90% 4805 261202 # 3 Extended offline Completed: read failure 90% 4766 261202 # 4 Extended offline Completed: read failure 90% 4762 261202 The most recent test (# 1) shows that the errors previously reported have now been corrected! Woo-Hoo! Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Randall R Schulz wrote:
The moral: Don't give up easily if you have a young, expensive drive that starts to give you SMART errors!
very interesting, thanks I have also (on a windows machine) a drive signaled as "near to fail" by the smart monitor for 3 years now and it works without problem jdd -- http://www.dodin.net http://www.ladepeche.fr/article/2007/10/27/127022-Claire-Dodin-une-Toulousai... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Randall R Schulz wrote:
Hi,
This is a follow-up to the thread "Hard Disk Failing."
To recap, SMART reported drive errors of the "...XYZ..." variety on a young and lightly used Western Digital Raptor drive.
It turned out (see below) that any attempt to access any of sectors 261200 through 261343 (a 144-sector range) would trigger retries that ultimately failed. SMART self-tests likewise failed upon reaching the first of these sectors.
Reading some articles on SMART by Bruce Allen (the author of the smartmontools package) suggested that these errors can sometimes be caused by mere discrepancy between the ECC data and the 512 bytes of actual recorded content of a given sector and that there could be many causes for this, including power failures while writing.
I decided to try a simple experiment: I would determine all the sectors that elicited an error when they were read and then rewrite them. I did this by using the "dd_rescue" utility. One of its options (-o) records a list of blocks for which unrecoverable errors were reported by the OS. This is how I obtained the list of 144 sectors that showed read errors.
Note: dd_rescue is apparently not designed to write to /dev/null, and every write operation it attempts to /dev/null yields an error message.
Once I had the list of (supposedly) bad blocks, I simply used an invocation of "dd" (the stock dd, not dd_rescue) to copy zero bytes (supplied by /dev/zero, of course) over the failing sectors.
Voila! After this, the bad sectors could be read without eliciting any error indication at all, requiring no retries nor producing any kernel messages.
The moral: Don't give up easily if you have a young, expensive drive that starts to give you SMART errors!
An interesting aside: The actual capacity of this drive appears to be nearly 7 GB (out of just under 140 GB) _larger_ than specified.
Randall Schulz
This and the other posts are very useful ... something for the atchive,, thx - -- ============================================================================== I have always wished that my computer would be as easy to use as my telephone. My wish has come true. I no longer know how to use my telephone. Bjarne Stroustrup ============================================================================== -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFHNYYBasN0sSnLmgIRAlbfAJ9aLvCjOw7mB7gqp0lo6i0O8kzcbwCg+QIx Y8IbrplARDUaYd8RzOFjw/8= =kCFP -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-11-09 at 16:09 -0800, Randall R Schulz wrote:
Once I had the list of (supposedly) bad blocks, I simply used an invocation of "dd" (the stock dd, not dd_rescue) to copy zero bytes (supplied by /dev/zero, of course) over the failing sectors.
Voila! After this, the bad sectors could be read without eliciting any error indication at all, requiring no retries nor producing any kernel messages.
I believe you may have simply triggered remapping of those bad sectors. You can discover if that's so because in smartctl output one of the lines counts them.
The moral: Don't give up easily if you have a young, expensive drive that starts to give you SMART errors!
Obviously :-) A percent of bad sectors are to be expected. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFHNbqGtTMYHG2NR9URAidDAJ9LXR6Jx+ka4kxA2ShJ/DEaHtZjiQCfdl2m Iqcg0pxbMrQ9wGKGizl0fa8= =xtXK -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Saturday 10 November 2007 06:04, Carlos E. R. wrote:
The Friday 2007-11-09 at 16:09 -0800, Randall R Schulz wrote:
Once I had the list of (supposedly) bad blocks, I simply used an invocation of "dd" (the stock dd, not dd_rescue) to copy zero bytes (supplied by /dev/zero, of course) over the failing sectors.
Voila! After this, the bad sectors could be read without eliciting any error indication at all, requiring no retries nor producing any kernel messages.
I believe you may have simply triggered remapping of those bad sectors. You can discover if that's so because in smartctl output one of the lines counts them.
Is that the "Reallocated_Event_Count"? 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 If so, and if I'm reading that correctly, then these sectors were not remapped.
... Cheers, Carlos E. R.
Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Saturday 10 November 2007 06:44, Randall R Schulz wrote:
On Saturday 10 November 2007 06:04, Carlos E. R. wrote:
The Friday 2007-11-09 at 16:09 -0800, Randall R Schulz wrote:
Once I had the list of (supposedly) bad blocks, I simply used an invocation of "dd" (the stock dd, not dd_rescue) to copy zero bytes (supplied by /dev/zero, of course) over the failing sectors.
Voila! After this, the bad sectors could be read without eliciting any error indication at all, requiring no retries nor producing any kernel messages.
I believe you may have simply triggered remapping of those bad sectors. You can discover if that's so because in smartctl output one of the lines counts them.
Is that the "Reallocated_Event_Count"?
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
Or "Reallocated_Sector_Ct"? 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 RRS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (5)
-
Carlos E. R.
-
G T Smith
-
jdd
-
Patrick Shanahan
-
Randall R Schulz