Re: [opensuse] Login weirdness

3 Nov 2018

      On 02/11/2018 18.57, Michael Fischer wrote:
...
On Fri, Nov 02, Carlos E. R. wrote:
...
On 02/11/2018 15.27, Michael Fischer wrote:
...
Ah, yes. Better run the test on all disks.
The ssd produced much less output from `smartctl -a` but also
nothing which suggested errors (good, as that is /)
I've got 2 external (usb-attached) drives which are my backups.
smartctl need a `-d sat` to produce output from one of them (happy)
and `-d scsi` for the other, which insisted that
SMART support is:     Available - device has SMART capability.
SMART support is:     Disabled
I did `$ sudo smartctl -d scsi -s on /dev/sdb` but to no effect in the 
output of `$ sudo smartctl -i -d scsi /dev/sdb`
Go figure. AFAIK, both those external disks are fine, but running badblocks
on them now for "grins".
USB disks are problematic with smart, the box firmware interferes. If
they are recent, the program doesn't always know how to access them.

I use "-d sat,12" on mine.
...
...
Concurrent to this, notice that there are several "extended offline"
tests that did not complete, all at the same LBA. I would rewrite that LBA.
You could try to find out to what file does that LBA belong, recover the
file if possible or replace with backup copy, and write to that LBA. Not
trivial. The write operation should trigger the remap.
Google-fu failing me as to how to go from LBA -> fs file(s). Suggestions?
Not trivial was an understatement on my part :-(

It is filesystem dependent. I don't have a rule of thumb to do it always.

From the LBA and the partition table you can find out the partition
involved. The next step is to find out the sector inside that partition,
doing some math, and then, find out the file, which usually requires
going through the entire list of files, the location of each file, and
compare with the target sector. Hopefully there is a tool, specific to
the filesystem, that does it.

Yes, there are google articles on it I found at some point, I should
have taken notes. Hum... where...

Sometimes I'm fortunate. I have a note I wrote describing the procedure,
but the LBA was on the SWAP, so I overwrote it entirely and done.
...
<3.2> 2016-09-19 13:16:21 Telcontar smartd 1161 - -  Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
<3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - -  Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
<3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - -  Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
<3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - -  Device: /dev/sda [SAT], previous self-test completed with error (read test element)
<3.2> 2016-09-19 13:46:21 Telcontar smartd 1161 - -  Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2116         47894552
# 2  Short offline       Completed without error       00%      2115         -
# 3  Short offline       Completed without error       00%      2108         -
...
Telcontar:/etc # fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes, 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk label type: gpt
#         Start          End    Size  Type            Name
 1         2048        16383      7M  BIOS boot parti primary
 2        16384     41961471     20G  Microsoft basic primary
 3     41961472     73416703     15G  Microsoft basic primary  <====
 4     73416704     75522047      1G  Microsoft basic primary
 5     75522048     77625343      1G  Microsoft basic primary
...
Telcontar:/etc # lsblk --output NAME,KNAME,RA,RM,RO,SIZE,TYPE,FSTYPE,LABEL,PARTLABEL,MOUNTPOINT,UUID,PARTUUID,WWN,MODEL,ALIGNMENT /dev/sda | grep sda3
├─sda3  sda3  512  0  0   15G part  swap              Swap_0      primary   [SWAP]              1cb5f0b4-d92a-4248-926c-0828c1f7eb48 d67674b0-b4d1-4adf-8b3e-e7cdb00703cf                              0
Telcontar:/etc #
So swap_0, sda3.

Here is an article for reiserfs, taken from another of my notes:

http://smartmontools.sourceforge.net/badblockhowto.html#reiserfs_ex

There must be more info in that howto, have a look at it.
...
...
Then run again the long test to see if it stops at another LBA, then
repeat till none appears.
You can also run "badblocks" on that disk. This test takes many hours
(even days), has to be done while umounted, thus from rescue media.
Sometimes this is enough to clear those bad sectors, sometimes they
appear again days later. If the command produces a list of bad sectors,
then write to them to force a remap.
One method is to rewrite to the entire partition with zeros or whatever,
then recover the data from backup.
Thanks much Carlos for the detailed response. Much appreciated.
Will try the --test=long tonight and report back.
Welcome :-)

-- 
Cheers / Saludos,

		Carlos E. R.
		(from 42.3 x86_64 "Malachite" at Telcontar)