On Fri, Nov 02, Carlos E. R. wrote:
On 02/11/2018 15.27, Michael Fischer wrote:
Ah, yes. Better run the test on all disks.
The ssd produced much less output from `smartctl -a` but also nothing which suggested errors (good, as that is /) I've got 2 external (usb-attached) drives which are my backups. smartctl need a `-d sat` to produce output from one of them (happy) and `-d scsi` for the other, which insisted that SMART support is: Available - device has SMART capability. SMART support is: Disabled I did `$ sudo smartctl -d scsi -s on /dev/sdb` but to no effect in the output of `$ sudo smartctl -i -d scsi /dev/sdb` Go figure. AFAIK, both those external disks are fine, but running badblocks on them now for "grins".
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
Ok, but watch this parameter.
7 Seek_Error_Rate 0x000f 077 060 045 Pre-fail Always - 57728258 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7752
Not an old disk. >
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 16 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 16
Ah. Yes, this is important.
Error 1 occurred at disk power-on lifetime: 7752 hours (323 days + 0 hours) When the command that caused the error occurred, the device was in an unknown state.
This section exceeds my skills, sorry. They are internal errors (to the disk firmware). And it is very recent, at 7752 hours.
Had a couple of "push button" forced restarts, and one complete power outage recently.
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short captive Interrupted (host reset) 60% 7752 - # 2 Short offline Completed without error 00% 7752 - # 3 Extended offline Completed: read failure 50% 7136 1001593016 # 4 Extended offline Completed: read failure 50% 6296 1001593016 # 5 Extended offline Completed: read failure 50% 5624 1001593016 # 6 Extended offline Completed: read failure 50% 4784 1001593016 # 7 Extended offline Completed: read failure 50% 4112 1001593016 # 8 Extended offline Completed: read failure 50% 3440 1001593016 # 9 Extended offline Completed: read failure 50% 2600 1001593016 #10 Extended offline Completed: read failure 50% 1929 1001593016 #11 Extended offline Completed: read failure 50% 1257 1021004240 #12 Extended offline Completed: read failure 50% 585 1001593008 #13 Short offline Completed without error 00% 0 -
Well, you have to do the long test to be sure. Notice that you can do the testing while you use the computer: it just may become sluggish or not respond. Do not power it off if it happens. Of course, the test will take longer if the computer is busy.
Parameter 197.
[snip]
Concurrent to this, notice that there are several "extended offline" tests that did not complete, all at the same LBA. I would rewrite that LBA.
You could try to find out to what file does that LBA belong, recover the file if possible or replace with backup copy, and write to that LBA. Not trivial. The write operation should trigger the remap.
Google-fu failing me as to how to go from LBA -> fs file(s). Suggestions?
Then run again the long test to see if it stops at another LBA, then repeat till none appears.
You can also run "badblocks" on that disk. This test takes many hours (even days), has to be done while umounted, thus from rescue media. Sometimes this is enough to clear those bad sectors, sometimes they appear again days later. If the command produces a list of bad sectors, then write to them to force a remap.
One method is to rewrite to the entire partition with zeros or whatever, then recover the data from backup.
Thanks much Carlos for the detailed response. Much appreciated. Will try the --test=long tonight and report back. Michael -- Michael Fischer michael@visv.net -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org