Hello, On Thu, 30 Jul 2020, Carlos E. R. wrote:
On 30/07/2020 21.06, David Haller wrote:
On Thu, 30 Jul 2020, Yamaban wrote: [..]
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
It's easier to use my script ataid_to_drive.sh[1] to translate the 'ataX[.YY]' ids to /dev/sd* or SCSI HOST:CTRL:ID:LUN ids, and I've not yet found any other tool that does that (not even 'lsblk -O -a' ;)
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: Telcontar:~ #
It has no instructions, so I don't know what to feed it :-?
You did correct, it should have shown something like this: $ ataid_to_drive.sh ata3.00 ata3.00 is: [ 1.717050] sd 2:0:0:0: [sdc] Attached SCSI disk [ 3.301179] sd 2:0:0:0: Attached scsi generic sg2 type 0 Hm. Seems like the kernel changed the sysfs contents (and thus also udev's by-path links? I only have e.g. by-path/pci-0000:00:11.0-scsi-0:0:0:0-part1)... Ah, yes. This should work: ==== #!/bin/bash oIFS="$IFS" IFS=$'\n' PATH="/sbin:/usr/sbin:$PATH" CTRLS=( $( lspci | grep 'ATA\|IDE') ) IFS="$oIFS" for arg; do if test -z "${arg/ata*}"; then arg="${arg/ata}" fi if test -z "${arg/*.*}"; then ata="${arg%.*}" subid="$(printf "%i" "${arg##*.}")"; else ata="$arg" fi echo "ata${ata}${subid/*/.$(printf "%02i" $subid)} is:" for ctrl in ${CTRLS[@]%% *}; do CPATH=$(echo "/sys/bus/pci/devices/"*"${ctrl}/") if test -d "${CPATH}/ata${ata}/"; then host=$(echo "${CPATH}/ata${ata}/"host*) fi if test -d "${host}"; then host="${host##*\/}" host="${host/host}" else idpath="${CPATH}/*/*/*/unique_id" host=$(grep "^${ata}$" $idpath 2>/dev/null | \ sed 's@.*/host\([0-9A-Fa-f]\+\)/.*@\1@') fi if test -n "$host"; then dmesg | grep "\] s[dr] $host:0:$subid.*Attached" fi done done ==== Anyway, this "manual" mode might work: $ HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') $ dmesg | grep "\] s[dr] ${HOSTID}:0:0.*Attached" If you have stuff with multiple LUNs (i.e. ata3.1 etc.), you need to adjust to ${HOSTID}:0:${SUBID} or something...
If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
I wonder if snapper somehow does an explicit 'ATA_CMD_FLUSH_EXT' (that's the easiest to grep for in the kernel-sources) and the drive barfs on that command and only understands 'ATA_CMD_FLUSH'... I've not looked at the specs or differences...
Done (with 'mc'): [..] Anything interesting there?
That's what I looked at (and a bit), but can't say anything about the difference between regular and _EXT flush. I mentioned that just for anyone interested do go digging deeper as a good starting point ;) And yeah, mc is wonderful for digs in the kernel- (and other) sources :))
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
Apropos: how about the 'smartctl -A' output?
Sure. I intend to trigger a long test before going to sleep tonight, though. Or perhaps, download the seagate test tool and boot it, running the test from CPU instead of disk firmware because I think it tests the cable.
Telcontar:~ # smartctl -A /dev/sdc smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.52-default] (SUSE RPM)
That's a bit dusty, not that it matters that much as long as you update your drivedb.h (use the included 'update-smart-drivedb' ;) Current seems to be 'smartmontools release 7.1 dated 2019-12-30'... But I digress once again. Let me reorder output to better fit commenting ;)
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 908 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10395 (128 204 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 906 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 061 040 Old_age Always - 33 (Min/Max 25/34) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1138 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 16 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10360h+29m+01.373s
This all looks very good, you keep it running rather long on average for a non 24/7 system ;) But who's I to say anything... $ calc 'printf("Carlos: %.1fh\nDavid: %.1fh\n", 10395/906, 44876/2714)' Carlos: ~11.5h David: ~16.5h (that's my SSD with the OSes ;)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 16641668645 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 29138379095
Hm. Is that in 512B or 4K LBAs? either way, that 4T disk does not get much use ;) My 128G SSD has ~6.2 TBW, and I tried to spare it for a long time[1].
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 314
That's an indication of unclean powerlosses... Nothing to worry, but you might want to stop the cause.
1 Raw_Read_Error_Rate 0x000f 081 062 006 Pre-fail Always - 139054323 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 701010635 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 139054323
Now, these would be quite suspicious. Were this not a Seagate drive :) ISTR they come out of the box something like this... Anyway, I concur that it's likely a flaky cable. But keep that "cache flush" issue in mind. You might try to trigger it with (while tailing messages) unmounting all partitions on that disk and then, after reading it all up in 'man hdparm': # sync # hdparm -f /dev/sdc # hdparm -F /dev/sdc # hdparm -y /dev/sdc maybe: # hdparm -Y /dev/sdc BTW: you might want to configure smartd to not log all temp-changes. Something like: DEFAULT -I 190 -I 194 -I 231 -W 1,40,45 or some such, see 'man 5 smartd.conf'... Not sure if you have to disable the "DEVICESCAN" feature for that. HTH, -dnh [1] (trimmed) 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 44876 12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 2714 177 Wear_Leveling_Count 0x0013 092 092 000 Pre-fail Always - 256 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 13298779544 -- I suppose you drink decaffeinated coffee as well, you pervert. -- Lionel -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org