On 02/11/2018 18.57, Michael Fischer wrote:
On Fri, Nov 02, Carlos E. R. wrote:
On 02/11/2018 15.27, Michael Fischer wrote:
Ah, yes. Better run the test on all disks.
The ssd produced much less output from `smartctl -a` but also nothing which suggested errors (good, as that is /)
I've got 2 external (usb-attached) drives which are my backups.
smartctl need a `-d sat` to produce output from one of them (happy) and `-d scsi` for the other, which insisted that
SMART support is: Available - device has SMART capability. SMART support is: Disabled
I did `$ sudo smartctl -d scsi -s on /dev/sdb` but to no effect in the output of `$ sudo smartctl -i -d scsi /dev/sdb`
Go figure. AFAIK, both those external disks are fine, but running badblocks on them now for "grins".
USB disks are problematic with smart, the box firmware interferes. If they are recent, the program doesn't always know how to access them. I use "-d sat,12" on mine.
Concurrent to this, notice that there are several "extended offline" tests that did not complete, all at the same LBA. I would rewrite that LBA.
You could try to find out to what file does that LBA belong, recover the file if possible or replace with backup copy, and write to that LBA. Not trivial. The write operation should trigger the remap.
Google-fu failing me as to how to go from LBA -> fs file(s). Suggestions?
Not trivial was an understatement on my part :-( It is filesystem dependent. I don't have a rule of thumb to do it always. From the LBA and the partition table you can find out the partition involved. The next step is to find out the sector inside that partition, doing some math, and then, find out the file, which usually requires going through the entire list of files, the location of each file, and compare with the target sector. Hopefully there is a tool, specific to the filesystem, that does it. Yes, there are google articles on it I found at some point, I should have taken notes. Hum... where... Sometimes I'm fortunate. I have a note I wrote describing the procedure, but the LBA was on the SWAP, so I overwrote it entirely and done.
<3.2> 2016-09-19 13:16:21 Telcontar smartd 1161 - - Device: /dev/sda [SAT], 8 Offline uncorrectable sectors <3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - - Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors <3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - - Device: /dev/sda [SAT], 8 Offline uncorrectable sectors <3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - - Device: /dev/sda [SAT], previous self-test completed with error (read test element) <3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - - Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 2116 47894552 # 2 Short offline Completed without error 00% 2115 - # 3 Short offline Completed without error 00% 2108 -
Telcontar:/etc # fdisk -l /dev/sda WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk label type: gpt
# Start End Size Type Name 1 2048 16383 7M BIOS boot parti primary 2 16384 41961471 20G Microsoft basic primary 3 41961472 73416703 15G Microsoft basic primary <==== 4 73416704 75522047 1G Microsoft basic primary 5 75522048 77625343 1G Microsoft basic primary
Telcontar:/etc # lsblk --output NAME,KNAME,RA,RM,RO,SIZE,TYPE,FSTYPE,LABEL,PARTLABEL,MOUNTPOINT,UUID,PARTUUID,WWN,MODEL,ALIGNMENT /dev/sda | grep sda3 ├─sda3 sda3 512 0 0 15G part swap Swap_0 primary [SWAP] 1cb5f0b4-d92a-4248-926c-0828c1f7eb48 d67674b0-b4d1-4adf-8b3e-e7cdb00703cf 0 Telcontar:/etc #
So swap_0, sda3. Here is an article for reiserfs, taken from another of my notes: http://smartmontools.sourceforge.net/badblockhowto.html#reiserfs_ex There must be more info in that howto, have a look at it.
Then run again the long test to see if it stops at another LBA, then repeat till none appears.
You can also run "badblocks" on that disk. This test takes many hours (even days), has to be done while umounted, thus from rescue media. Sometimes this is enough to clear those bad sectors, sometimes they appear again days later. If the command produces a list of bad sectors, then write to them to force a remap.
One method is to rewrite to the entire partition with zeros or whatever, then recover the data from backup.
Thanks much Carlos for the detailed response. Much appreciated.
Will try the --test=long tonight and report back.
Welcome :-) -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)