Re: [opensuse] Re: Hard disk crash

31 Jul 2020

      Hello,

On Thu, 30 Jul 2020, Carlos E. R. wrote:
...
On 30/07/2020 21.06, David Haller wrote:
...
On Thu, 30 Jul 2020, Yamaban wrote:
[..]
...
first:
look at the links in this dir: /dev/disk/by-path
esp the lines that contain ata3:
:> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem.
for more info what drive that is, use the "sd?" drive-id and
look at the content of "/dev/disk/by-id" with that id e.g. "sdd"
:> ls -l /dev/disk/by-id |grep sdd
should give something similar to "ata-Seagate..........." with the
product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
It's easier to use my script ataid_to_drive.sh[1] to translate the
'ataX[.YY]' ids to /dev/sd* or SCSI HOST:CTRL:ID:LUN ids, and I've not
yet found any other tool that does that (not even 'lsblk -O -a' ;)
Telcontar:~ # ataid_to_drive.sh ata3.00
ata3.00 is:
Telcontar:~ #
It has no instructions, so I don't know what to feed it :-?
You did correct, it should have shown something like this:

$ ataid_to_drive.sh ata3.00
ata3.00 is:
[    1.717050] sd 2:0:0:0: [sdc] Attached SCSI disk
[    3.301179] sd 2:0:0:0: Attached scsi generic sg2 type 0

Hm. Seems like the kernel changed the sysfs contents (and thus also
udev's by-path links? I only have e.g. 
by-path/pci-0000:00:11.0-scsi-0:0:0:0-part1)...

Ah, yes. This should work:

====
#!/bin/bash
oIFS="$IFS"
IFS=$'\n'
PATH="/sbin:/usr/sbin:$PATH"
CTRLS=( $( lspci | grep 'ATA\|IDE') )
IFS="$oIFS"
for arg; do
    if test -z "${arg/ata*}"; then
        arg="${arg/ata}"
    fi
    if test -z "${arg/*.*}"; then
        ata="${arg%.*}"
        subid="$(printf "%i" "${arg##*.}")";
    else
        ata="$arg"
    fi
    echo "ata${ata}${subid/*/.$(printf "%02i" $subid)} is:"
    for ctrl in ${CTRLS[@]%% *}; do
		CPATH=$(echo "/sys/bus/pci/devices/"*"${ctrl}/")
		if test -d "${CPATH}/ata${ata}/"; then
			host=$(echo "${CPATH}/ata${ata}/"host*)
        fi
        if test -d "${host}"; then
            host="${host##*\/}"
            host="${host/host}"
        else
            idpath="${CPATH}/*/*/*/unique_id"
            host=$(grep "^${ata}$" $idpath 2>/dev/null | \
                sed 's@.*/host\([0-9A-Fa-f]\+\)/.*@\1@')
		fi
		if test -n "$host"; then
			dmesg | grep "\] s[dr] $host:0:$subid.*Attached"
		fi
    done
done
====

Anyway, this "manual" mode might work:

$ HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@')
$ dmesg | grep "\] s[dr] ${HOSTID}:0:0.*Attached"

If you have stuff with multiple LUNs (i.e. ata3.1 etc.), you need to
adjust to ${HOSTID}:0:${SUBID} or something...
...
...
...
If not stop snapper (-timer via systemd), and do so NOW.
Power down. Disconnect Drive, maybe also remove it at the same time
I wonder if snapper somehow does an explicit 'ATA_CMD_FLUSH_EXT'
(that's the easiest to grep for in the kernel-sources) and the drive
barfs on that command and only understands 'ATA_CMD_FLUSH'... I've not
looked at the specs or differences...
Done (with 'mc'):
[..]
Anything interesting there?
That's what I looked at (and a bit), but can't say anything about the
difference between regular and _EXT flush. I mentioned that just for
anyone interested do go digging deeper as a good starting point ;)
And yeah, mc is wonderful for digs in the kernel- (and other) sources :))
...
...
...
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug,
or a real hw failure the s.m.a.r.t system can not identify.
Apropos: how about the 'smartctl -A' output?
Sure. I intend to trigger a long test before going to sleep tonight,
though. Or perhaps, download the seagate test tool and boot it,
running the test from CPU instead of disk firmware because I think it
tests the cable.
Telcontar:~ # smartctl -A /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.52-default] (SUSE RPM)
That's a bit dusty, not that it matters that much as long as you
update your drivedb.h (use the included 'update-smart-drivedb' ;)
Current seems to be 'smartmontools release 7.1 dated 2019-12-30'...
But I digress once again.

Let me reorder output to better fit commenting ;)
...
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
3 Spin_Up_Time            0x0003   097   096   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       908
 5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
 9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       10395 (128 204 0)
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       906
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   061   040    Old_age   Always       -       33 (Min/Max 25/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1138
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10360h+29m+01.373s
This all looks very good, you keep it running rather long on average
for a non 24/7 system ;) But who's I to say anything...

$ calc 'printf("Carlos: %.1fh\nDavid:  %.1fh\n", 10395/906, 44876/2714)'
Carlos: ~11.5h
David:  ~16.5h

(that's my SSD with the OSes ;)
...
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       16641668645
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       29138379095
Hm. Is that in 512B or 4K LBAs? either way, that 4T disk does not get
much use ;) My 128G SSD has ~6.2 TBW, and I tried to spare it for a
long time[1].
...
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       314
That's an indication of unclean powerlosses... Nothing to worry, but
you might want to stop the cause.
...
1 Raw_Read_Error_Rate     0x000f   081   062   006    Pre-fail  Always       -       139054323
 7 Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       701010635
195 Hardware_ECC_Recovered  0x001a   081   064   000    Old_age   Always       -       139054323
Now, these would be quite suspicious. Were this not a Seagate drive :)
ISTR they come out of the box something like this...

Anyway, I concur that it's likely a flaky cable. But keep that "cache
flush" issue in mind. You might try to trigger it with (while tailing
messages) unmounting all partitions on that disk and then, after
reading it all up in 'man hdparm':

# sync
# hdparm -f /dev/sdc
# hdparm -F /dev/sdc
# hdparm -y /dev/sdc
maybe:
# hdparm -Y /dev/sdc

BTW: you might want to configure smartd to not log all temp-changes.

Something like:

DEFAULT  -I 190 -I 194 -I 231 -W 1,40,45

or some such, see 'man 5 smartd.conf'... Not sure if you have to
disable the "DEVICESCAN" feature for that.

HTH,
-dnh

[1] (trimmed)
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       44876
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       2714
177 Wear_Leveling_Count     0x0013   092   092   000    Pre-fail  Always       -       256
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       13298779544

-- 
I suppose you drink decaffeinated coffee as well, you pervert.  -- Lionel

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org