-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi
I had a hard disk crash last month and another one yesterday. I had to hit the hardware reset button to "recover".
First occurrence:
<3.6> 2020-06-22T12:40:25.270365+02:00 Telcontar smartd 1470 - - Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 63 <3.6> 2020-06-22T12:40:25.270606+02:00 Telcontar smartd 1470 - - Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 37 <3.6> 2020-06-22T12:40:25.270710+02:00 Telcontar smartd 1470 - - Device: /dev/sda [SAT], previous self-test completed without error <3.6> 2020-06-22T12:40:25.642623+02:00 Telcontar smartd 1470 - - Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 64 <3.6> 2020-06-22T12:40:25.642829+02:00 Telcontar smartd 1470 - - Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 36 <3.6> 2020-06-22T12:40:25.642949+02:00 Telcontar smartd 1470 - - Device: /dev/sdb [SAT], previous self-test completed without error <3.6> 2020-06-22T12:40:25.835124+02:00 Telcontar smartd 1470 - - Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 65 <3.6> 2020-06-22T12:40:25.835308+02:00 Telcontar smartd 1470 - - Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 35 <3.6> 2020-06-22T12:40:25.835408+02:00 Telcontar smartd 1470 - - Device: /dev/sdc [SAT], previous self-test completed without error <3.6> 2020-06-22T12:40:26.184405+02:00 Telcontar smartd 1470 - - Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 65 to 63 <3.6> 2020-06-22T12:40:26.184628+02:00 Telcontar smartd 1470 - - Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 35 to 37 <3.6> 2020-06-22T12:40:26.184745+02:00 Telcontar smartd 1470 - - Device: /dev/sdd [SAT], previous self-test completed without error <3.6> 2020-06-22T12:47:17.731257+02:00 Telcontar dbus-daemon 9743 - - [session uid=1000 pid=9743] Activating service name='org.gnome.keyring.SystemPrompter' requested by ':1 .428' (uid=1000 pid=27919 comm="/usr/bin/pinentry-gnome3 ") <3.6> 2020-06-22T12:47:17.773078+02:00 Telcontar dbus-daemon 9743 - - [session uid=1000 pid=9743] Successfully activated service 'org.gnome.keyring.SystemPrompter' <9.6> 2020-06-22T12:50:01.855266+02:00 Telcontar CRON 28033 - - (root) CMD ([ -x /usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 1 1) <3.6> 2020-06-22T13:00:00.267363+02:00 Telcontar systemd 1 - - Started Timeline of Snapper Snapshots. <3.6> 2020-06-22T13:00:00.302345+02:00 Telcontar dbus-daemon 1350 - - [system] Activating service name='org.opensuse.Snapper' requested by ':1.3164' (uid=0 pid=28336 comm="/ usr/lib/snapper/systemd-helper --timeline ") (using servicehelper) <3.6> 2020-06-22T13:00:00.369997+02:00 Telcontar dbus-daemon 1350 - - [system] Successfully activated service 'org.opensuse.Snapper' <9.6> 2020-06-22T13:00:01.902111+02:00 Telcontar CRON 28346 - - (root) CMD ([ -x /usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 1 1) <10.6> 2020-06-22T13:00:01.973157+02:00 Telcontar cron 28342 - - pam_unix(crond:session): session opened for user cer by (uid=0) <10.6> 2020-06-22T13:00:01.974030+02:00 Telcontar cron 28343 - - pam_unix(crond:session): session opened for user cer by (uid=0) <9.6> 2020-06-22T13:00:01.974400+02:00 Telcontar CRON 28394 - - (cer) CMD (/home/cer/bin/dar_la_hora_en_cron hora) <10.6> 2020-06-22T13:00:01.992853+02:00 Telcontar CRON 28343 - - pam_unix(crond:session): session closed for user cer <10.6> 2020-06-22T13:00:02.209451+02:00 Telcontar CRON 28342 - - pam_unix(crond:session): session closed for user cer <0.3> 2020-06-22T13:01:38.170168+02:00 Telcontar kernel - - - [250558.404347] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen <0.3> 2020-06-22T13:01:38.170183+02:00 Telcontar kernel - - - [250558.404350] ata3.00: failed command: FLUSH CACHE EXT <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22 <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) <0.3> 2020-06-22T13:01:38.170185+02:00 Telcontar kernel - - - [250558.404354] ata3.00: status: { DRDY } <0.6> 2020-06-22T13:01:38.170186+02:00 Telcontar kernel - - - [250558.404357] ata3: hard resetting link <0.3> 2020-06-22T13:01:48.167284+02:00 Telcontar kernel - - - [250568.404421] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:01:48.167294+02:00 Telcontar kernel - - - [250568.404424] ata3: hard resetting link <0.3> 2020-06-22T13:01:58.169610+02:00 Telcontar kernel - - - [250578.404658] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:01:58.169622+02:00 Telcontar kernel - - - [250578.404661] ata3: hard resetting link <0.3> 2020-06-22T13:02:33.170216+02:00 Telcontar kernel - - - [250613.405141] ata3: softreset failed (1st FIS failed) <0.4> 2020-06-22T13:02:33.170228+02:00 Telcontar kernel - - - [250613.405144] ata3: limiting SATA link speed to 3.0 Gbps <0.6> 2020-06-22T13:02:33.170229+02:00 Telcontar kernel - - - [250613.405146] ata3: hard resetting link <0.6> 2020-06-22T13:02:38.382166+02:00 Telcontar kernel - - - [250618.616572] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) <0.7> 2020-06-22T13:02:38.382179+02:00 Telcontar kernel - - - [250618.616576] ata3.00: link online but device misclassified <0.4> 2020-06-22T13:02:43.422167+02:00 Telcontar kernel - - - [250623.656580] ata3.00: qc timeout (cmd 0xec) <0.4> 2020-06-22T13:02:43.422179+02:00 Telcontar kernel - - - [250623.656587] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) <0.3> 2020-06-22T13:02:43.422180+02:00 Telcontar kernel - - - [250623.656589] ata3.00: revalidation failed (errno=-5) <0.6> 2020-06-22T13:02:43.422181+02:00 Telcontar kernel - - - [250623.656592] ata3: hard resetting link <0.3> 2020-06-22T13:02:53.422148+02:00 Telcontar kernel - - - [250633.657229] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:02:53.422161+02:00 Telcontar kernel - - - [250633.657232] ata3: hard resetting link <0.3> 2020-06-22T13:03:03.422144+02:00 Telcontar kernel - - - [250643.657623] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:03:03.422154+02:00 Telcontar kernel - - - [250643.657625] ata3: hard resetting link <0.3> 2020-06-22T13:03:38.422151+02:00 Telcontar kernel - - - [250678.657249] ata3: softreset failed (1st FIS failed) <0.4> 2020-06-22T13:03:38.422163+02:00 Telcontar kernel - - - [250678.657253] ata3: limiting SATA link speed to 1.5 Gbps <0.6> 2020-06-22T13:03:38.422165+02:00 Telcontar kernel - - - [250678.657254] ata3: hard resetting link <0.6> 2020-06-22T13:03:43.642165+02:00 Telcontar kernel - - - [250683.876788] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) <0.7> 2020-06-22T13:03:43.642180+02:00 Telcontar kernel - - - [250683.876792] ata3.00: link online but device misclassified <0.4> 2020-06-22T13:03:53.822166+02:00 Telcontar kernel - - - [250694.056819] ata3.00: qc timeout (cmd 0xec) <0.4> 2020-06-22T13:03:53.822180+02:00 Telcontar kernel - - - [250694.056827] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) <0.3> 2020-06-22T13:03:53.822181+02:00 Telcontar kernel - - - [250694.056828] ata3.00: revalidation failed (errno=-5) <0.6> 2020-06-22T13:03:53.822183+02:00 Telcontar kernel - - - [250694.056832] ata3: hard resetting link <0.3> 2020-06-22T13:04:03.822166+02:00 Telcontar kernel - - - [250704.057810] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:04:03.822179+02:00 Telcontar kernel - - - [250704.057812] ata3: hard resetting link <0.3> 2020-06-22T13:04:13.822163+02:00 Telcontar kernel - - - [250714.057147] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:04:13.822175+02:00 Telcontar kernel - - - [250714.057150] ata3: hard resetting link <0.3> 2020-06-22T13:04:48.822154+02:00 Telcontar kernel - - - [250749.057907] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:04:48.822167+02:00 Telcontar kernel - - - [250749.057911] ata3: hard resetting link <0.6> 2020-06-22T13:04:54.038172+02:00 Telcontar kernel - - - [250754.273047] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) <0.7> 2020-06-22T13:04:54.038185+02:00 Telcontar kernel - - - [250754.273051] ata3.00: link online but device misclassified <3.6> 2020-06-22T13:05:21.032478+02:00 Telcontar systemd 1 - - Stopping Dovecot IMAP/POP3 email server... <0.4> 2020-06-22T13:05:25.470153+02:00 Telcontar kernel - - - [250785.705150] ata3.00: qc timeout (cmd 0xec) <0.4> 2020-06-22T13:05:25.470167+02:00 Telcontar kernel - - - [250785.705158] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) <0.3> 2020-06-22T13:05:25.470168+02:00 Telcontar kernel - - - [250785.705159] ata3.00: revalidation failed (errno=-5) <0.4> 2020-06-22T13:05:25.470169+02:00 Telcontar kernel - - - [250785.705161] ata3.00: disabled <0.6> 2020-06-22T13:05:25.470170+02:00 Telcontar kernel - - - [250785.705172] ata3: hard resetting link <0.3> 2020-06-22T13:05:35.470168+02:00 Telcontar kernel - - - [250795.705351] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:05:35.470182+02:00 Telcontar kernel - - - [250795.705353] ata3: hard resetting link <0.3> 2020-06-22T13:05:35.470168+02:00 Telcontar kernel - - - [250795.705351] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:05:35.470182+02:00 Telcontar kernel - - - [250795.705353] ata3: hard resetting link <0.3> 2020-06-22T13:05:45.470146+02:00 Telcontar kernel - - - [250805.705811] ata3: softreset failed (1st FIS failed) <0.6> 2020-06-22T13:05:45.470161+02:00 Telcontar kernel - - - [250805.705814] ata3: hard resetting link <3.6> 2020-06-22T13:06:03.103885+02:00 Telcontar pulseaudio 9815 - - ICE default IO error handler doing an exit(), pid = 9815, errno = 11 <3.6> 2020-06-22T13:06:03.119549+02:00 Telcontar at-spi-bus-launcher 9862 - - XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0" <3.6> 2020-06-22T13:06:03.119724+02:00 Telcontar at-spi-bus-launcher 9862 - - after 115009 requests (115009 known processed) with 0 events remaining. <1.7> 2020-06-22T13:06:03.121490+02:00 Telcontar sddm-helper 9726 - - [PAM] Closing session
and reboot.
The disk or the interface can be identified on next boot:
<0.6> 2020-06-22T13:08:00.540722+02:00 Telcontar kernel - - - [ 2.512438] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) <0.6> 2020-06-22T13:08:00.540723+02:00 Telcontar kernel - - - [ 2.515338] ata3.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133 <0.6> 2020-06-22T13:08:00.540725+02:00 Telcontar kernel - - - [ 2.515343] ata3.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA <0.6> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.517902] ata3.00: configured for UDMA/133 <0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.518066] scsi 2:0:0:0: Direct-Access ATA ST4000DM004-2CV1 0001 PQ: 0 ANSI: 5 <0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.518210] sd 2:0:0:0: Attached scsi generic sg2 type 0 <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ 2.518262] sd 2:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB) <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ 2.518267] sd 2:0:0:0: [sdc] 4096-byte physical blocks
<0.6> 2020-06-22T13:08:00.540764+02:00 Telcontar kernel - - - [ 2.578542] sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 sdc11 sdc12 sdc13 sdc14 <0.5> 2020-06-22T13:08:00.540765+02:00 Telcontar kernel - - - [ 2.578979] sd 2:0:0:0: [sdc] Attached SCSI disk
Yesterday event:
<1.6> 2020-07-29T19:53:38.848734+02:00 Telcontar org.freedesktop.thumbnails.Cache1 4259 - - Registered thumbailer gsf-office-thumbnailer -i %i -o %o -s %s <1.6> 2020-07-29T19:53:38.849607+02:00 Telcontar org.freedesktop.thumbnails.Cache1 4259 - - Registered thumbailer /usr/bin/totem-video-thumbnailer -s %s %u %o <1.6> 2020-07-29T19:53:38.849730+02:00 Telcontar org.freedesktop.thumbnails.Cache1 4259 - - Registered thumbailer /usr/bin/gdk-pixbuf-thumbnailer -s %s %u %o <1.6> 2020-07-29T19:53:38.849990+02:00 Telcontar org.freedesktop.thumbnails.Cache1 4259 - - Registered thumbailer xreader-thumbnailer -s %s %u %o <3.6> 2020-07-29T19:53:38.867963+02:00 Telcontar dbus-daemon 4259 - - [session uid=1000 pid=4259] Successfully activated service 'org.freedesktop.thumbnails.Cache1' <9.6> 2020-07-29T20:00:01.248507+02:00 Telcontar CRON 32222 - - (root) CMD ([ -x /usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 1 1) <3.6> 2020-07-29T20:00:01.254182+02:00 Telcontar systemd 1 - - Started Timeline of Snapper Snapshots. <10.6> 2020-07-29T20:00:01.285419+02:00 Telcontar cron 32218 - - pam_unix(crond:session): session opened for user cer by (uid=0) <3.6> 2020-07-29T20:00:01.302968+02:00 Telcontar dbus-daemon 1488 - - [system] Activating service name='org.opensuse.Snapper' requested by ':1.4309' (uid=0 pid=32230 comm="/usr/lib/snapper/systemd-helper --timeline ") (using servicehelper) <10.6> 2020-07-29T20:00:01.334902+02:00 Telcontar cron 32219 - - pam_unix(crond:session): session opened for user cer by (uid=0) <9.6> 2020-07-29T20:00:01.335326+02:00 Telcontar CRON 32275 - - (cer) CMD (/home/cer/bin/dar_la_hora_en_cron hora) <10.6> 2020-07-29T20:00:01.349451+02:00 Telcontar CRON 32219 - - pam_unix(crond:session): session closed for user cer <3.6> 2020-07-29T20:00:01.369929+02:00 Telcontar dbus-daemon 1488 - - [system] Successfully activated service 'org.opensuse.Snapper' <10.6> 2020-07-29T20:00:01.555092+02:00 Telcontar CRON 32218 - - pam_unix(crond:session): session closed for user cer <0.3> 2020-07-29T20:01:21.690682+02:00 Telcontar kernel - - - [320053.217028] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen <0.3> 2020-07-29T20:01:21.690694+02:00 Telcontar kernel - - - [320053.217030] ata3.00: failed command: FLUSH CACHE EXT <0.3> 2020-07-29T20:01:21.690695+02:00 Telcontar kernel - - - [320053.217034] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 6 <0.3> 2020-07-29T20:01:21.690696+02:00 Telcontar kernel - - - [320053.217034] res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) <0.3> 2020-07-29T20:01:21.690697+02:00 Telcontar kernel - - - [320053.217036] ata3.00: status: { DRDY } <0.6> 2020-07-29T20:01:21.690698+02:00 Telcontar kernel - - - [320053.217039] ata3: hard resetting link <0.3> 2020-07-29T20:01:31.690677+02:00 Telcontar kernel - - - [320063.217607] ata3: softreset failed (1st FIS failed) <0.6> 2020-07-29T20:01:31.690690+02:00 Telcontar kernel - - - [320063.217609] ata3: hard resetting link <0.3> 2020-07-29T20:01:41.690680+02:00 Telcontar kernel - - - [320073.217267] ata3: softreset failed (1st FIS failed) <0.6> 2020-07-29T20:01:41.690692+02:00 Telcontar kernel - - - [320073.217270] ata3: hard resetting link <0.3> 2020-07-29T20:02:16.688064+02:00 Telcontar kernel - - - [320108.218075] ata3: softreset failed (1st FIS failed) <0.4> 2020-07-29T20:02:16.688075+02:00 Telcontar kernel - - - [320108.218078] ata3: limiting SATA link speed to 3.0 Gbps <0.6> 2020-07-29T20:02:16.688076+02:00 Telcontar kernel - - - [320108.218080] ata3: hard resetting link <0.6> 2020-07-29T20:02:21.853511+02:00 Telcontar kernel - - - [320113.381256] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) <0.7> 2020-07-29T20:02:21.853523+02:00 Telcontar kernel - - - [320113.381260] ata3.00: link online but device misclassified <0.4> 2020-07-29T20:02:26.930684+02:00 Telcontar kernel - - - [320118.457272] ata3.00: qc timeout (cmd 0xec) <0.4> 2020-07-29T20:02:26.930699+02:00 Telcontar kernel - - - [320118.457280] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) <0.3> 2020-07-29T20:02:26.930701+02:00 Telcontar kernel - - - [320118.457282] ata3.00: revalidation failed (errno=-5) <0.6> 2020-07-29T20:02:26.930708+02:00 Telcontar kernel - - - [320118.457285] ata3: hard resetting link <0.3> 2020-07-29T20:02:36.930677+02:00 Telcontar kernel - - - [320128.457434] ata3: softreset failed (1st FIS failed) <0.6> 2020-07-29T20:02:36.930692+02:00 Telcontar kernel - - - [320128.457437] ata3: hard resetting link <0.3> 2020-07-29T20:02:46.930682+02:00 Telcontar kernel - - - [320138.457591] ata3: softreset failed (1st FIS failed) <0.6> 2020-07-29T20:02:46.930695+02:00 Telcontar kernel - - - [320138.457594] ata3: hard resetting link <0.3> 2020-07-29T20:03:21.930679+02:00 Telcontar kernel - - - [320173.458400] ata3: softreset failed (1st FIS failed) <0.4> 2020-07-29T20:03:21.930693+02:00 Telcontar kernel - - - [320173.458404] ata3: limiting SATA link speed to 1.5 Gbps <0.6> 2020-07-29T20:03:21.930695+02:00 Telcontar kernel - - - [320173.458406] ata3: hard resetting link <0.6> 2020-07-29T20:03:27.146680+02:00 Telcontar kernel - - - [320178.673492] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) <0.7> 2020-07-29T20:03:27.146694+02:00 Telcontar kernel - - - [320178.673496] ata3.00: link online but device misclassified <0.4> 2020-07-29T20:03:37.330677+02:00 Telcontar kernel - - - [320188.857523] ata3.00: qc timeout (cmd 0xec) <0.4> 2020-07-29T20:03:37.330690+02:00 Telcontar kernel - - - [320188.857532] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) <0.3> 2020-07-29T20:03:37.330691+02:00 Telcontar kernel - - - [320188.857533] ata3.00: revalidation failed (errno=-5) <0.6> 2020-07-29T20:03:37.330692+02:00 Telcontar kernel - - - [320188.857538] ata3: hard resetting link <0.3> 2020-07-29T20:03:47.328555+02:00 Telcontar kernel - - - [320198.858357] ata3: softreset failed (1st FIS failed) <0.6> 2020-07-29T20:03:47.328569+02:00 Telcontar kernel - - - [320198.858359] ata3: hard resetting link 2020-07-29 20:05:16+02:00 - Booting the system now ================================================================================ Linux Telcontar 4.12.14-lp151.28.52-default #1 SMP Wed Jun 10 15:32:08 UTC 2020 (464fb5f) x86_64 x86_64 x86_64 GNU/Linux <3.6> 2020-07-29T20:05:17.812270+02:00 Telcontar systemd 1 - - systemd 234 running in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybrid)
I did not bother to do a software halt this time, useless. It is the same disk. I did a long SMART test the previous time on all disks, found nothing.
What should I look at?
The machine is currently running Leap 15.1, with an AMD Ryzen 5 processor, one NVME M.2 "disk" (new), and four mechanical hard disks (old), migrated from the previous Intel Core 2 machine this year, where I didn't have any issues like this. The four disks are the same brand and family, different ages.
If there is some missing info, just ask.
Both times, it happened one second after this line:
<3.6> 2020-07-29T20:00:01.254182+02:00 Telcontar systemd 1 - - Started Timeline of Snapper Snapshots.
What should I look at?
- -- Cheers
Carlos E. R. (from 15.1 x86_64 at Telcontar)
On Thu, 30 Jul 2020 15:30, Carlos E. R. <robin.listas@...> wrote:
Hi
I had a hard disk crash last month and another one yesterday. I had to hit the hardware reset button to "recover".
First occurrence:
<snip>
<0.3> 2020-06-22T13:01:38.170168+02:00 Telcontar kernel - - - [250558.404347] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen <0.3> 2020-06-22T13:01:38.170183+02:00 Telcontar kernel - - - [250558.404350] ata3.00: failed command: FLUSH CACHE EXT <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22 <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) <0.3> 2020-06-22T13:01:38.170185+02:00 Telcontar kernel - - - [250558.404354] ata3.00: status: { DRDY } <0.6> 2020-06-22T13:01:38.170186+02:00 Telcontar kernel - - - [250558.404357] ata3: hard resetting link
<snip> Failed drive is connected via ata3 <snip>
The disk or the interface can be identified on next boot:
<0.6> 2020-06-22T13:08:00.540722+02:00 Telcontar kernel - - - [ 2.512438] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) <0.6> 2020-06-22T13:08:00.540723+02:00 Telcontar kernel - - - [ 2.515338] ata3.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133 <0.6> 2020-06-22T13:08:00.540725+02:00 Telcontar kernel - - - [ 2.515343] ata3.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA <0.6> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.517902] ata3.00: configured for UDMA/133
<snip> the next lines follow directly. but the identifier is different: before: ST4000DM004-2CV104 after: ST4000DM004-2CV1 <snip>
<0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.518066] scsi 2:0:0:0: Direct-Access ATA ST4000DM004-2CV1 0001 PQ: 0 ANSI: 5 <0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.518210] sd 2:0:0:0: Attached scsi generic sg2 type 0 <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ 2.518262] sd 2:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB) <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ 2.518267] sd 2:0:0:0: [sdc] 4096-byte physical blocks
<0.6> 2020-06-22T13:08:00.540764+02:00 Telcontar kernel - - - [ 2.578542] sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 sdc11 sdc12 sdc13 sdc14 <0.5> 2020-06-22T13:08:00.540765+02:00 Telcontar kernel - - - [ 2.578979] sd 2:0:0:0: [sdc] Attached SCSI disk
<snip>
<0.3> 2020-07-29T20:01:21.690682+02:00 Telcontar kernel - - - [320053.217028] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen <0.3> 2020-07-29T20:01:21.690694+02:00 Telcontar kernel - - - [320053.217030] ata3.00: failed command: FLUSH CACHE EXT <0.3> 2020-07-29T20:01:21.690695+02:00 Telcontar kernel - - - [320053.217034] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 6 <0.3> 2020-07-29T20:01:21.690696+02:00 Telcontar kernel - - - [320053.217034] res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) <0.3> 2020-07-29T20:01:21.690697+02:00 Telcontar kernel - - - [320053.217036] ata3.00: status: { DRDY }
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
Such a Seagate 4TB drive is atm around 100€, do yourself a favour and replace it.
But at bare minium diconnect the drive for your computer ASAP. Backup up-to-date? If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
- Yamaban
On 30/07/2020 19.20, Yamaban wrote:
On Thu, 30 Jul 2020 15:30, Carlos E. R. <robin.listas@...> wrote:
Hi
I had a hard disk crash last month and another one yesterday. I had to hit the hardware reset button to "recover".
First occurrence:
<snip> > <0.3> 2020-06-22T13:01:38.170168+02:00 Telcontar kernel - - - > [250558.404347] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action > 0x6 frozen > <0.3> 2020-06-22T13:01:38.170183+02:00 Telcontar kernel - - - > [250558.404350] ata3.00: failed command: FLUSH CACHE EXT > <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - > [250558.404353] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22 > <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - > [250558.404353] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 > (timeout) > <0.3> 2020-06-22T13:01:38.170185+02:00 Telcontar kernel - - - > [250558.404354] ata3.00: status: { DRDY } > <0.6> 2020-06-22T13:01:38.170186+02:00 Telcontar kernel - - - > [250558.404357] ata3: hard resetting link <snip> Failed drive is connected via ata3 <snip> > The disk or the interface can be identified on next boot: > > <0.6> 2020-06-22T13:08:00.540722+02:00 Telcontar kernel - - - [ > 2.512438] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) > <0.6> 2020-06-22T13:08:00.540723+02:00 Telcontar kernel - - - [ > 2.515338] ata3.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133 > <0.6> 2020-06-22T13:08:00.540725+02:00 Telcontar kernel - - - [ > 2.515343] ata3.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth > 31/32), AA > <0.6> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ > 2.517902] ata3.00: configured for UDMA/133 <snip> the next lines follow directly. but the identifier is different: before: ST4000DM004-2CV104 after: ST4000DM004-2CV1 <snip> > <0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ > 2.518066] scsi 2:0:0:0: Direct-Access ATA ST4000DM004-2CV1 > 0001 PQ: 0 ANSI: 5 > <0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ > 2.518210] sd 2:0:0:0: Attached scsi generic sg2 type 0 > <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ > 2.518262] sd 2:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 > TB/3.64 TiB) > <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ > 2.518267] sd 2:0:0:0: [sdc] 4096-byte physical blocks > > <0.6> 2020-06-22T13:08:00.540764+02:00 Telcontar kernel - - - [ > 2.578542] sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 > sdc11 sdc12 sdc13 sdc14 > <0.5> 2020-06-22T13:08:00.540765+02:00 Telcontar kernel - - - [ > 2.578979] sd 2:0:0:0: [sdc] Attached SCSI disk <snip> > <0.3> 2020-07-29T20:01:21.690682+02:00 Telcontar kernel - - - > [320053.217028] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action > 0x6 frozen > <0.3> 2020-07-29T20:01:21.690694+02:00 Telcontar kernel - - - > [320053.217030] ata3.00: failed command: FLUSH CACHE EXT > <0.3> 2020-07-29T20:01:21.690695+02:00 Telcontar kernel - - - > [320053.217034] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 6 > <0.3> 2020-07-29T20:01:21.690696+02:00 Telcontar kernel - - - > [320053.217034] res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 > (timeout) > <0.3> 2020-07-29T20:01:21.690697+02:00 Telcontar kernel - - - > [320053.217036] ata3.00: status: { DRDY }
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
Nothing.
cer@Telcontar:~> ls -l /dev/disk/by-path | grep ata3 cer@Telcontar:~>
It has to be ata-3
cer@Telcontar:~> ls -l /dev/disk/by-path | grep ata-3 lrwxrwxrwx 1 root root 9 Jul 29 20:04 pci-0000:03:00.1-ata-3 -> ../../sdc lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3-part10 -> ../../sdc10 lrwxrwxrwx 1 root root 11 Jul 29 20:05 pci-0000:03:00.1-ata-3-part11 -> ../../sdc11 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3-part12 -> ../../sdc12 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3-part13 -> ../../sdc13 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3-part14 -> ../../sdc14 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part2 -> ../../sdc2 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part3 -> ../../sdc3 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part4 -> ../../sdc4 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part5 -> ../../sdc5 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part6 -> ../../sdc6 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part7 -> ../../sdc7 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3-part8 -> ../../sdc8 lrwxrwxrwx 1 root root 10 Jul 29 20:05 pci-0000:03:00.1-ata-3-part9 -> ../../sdc9 lrwxrwxrwx 1 root root 9 Jul 29 20:04 pci-0000:03:00.1-ata-3.0 -> ../../sdc lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part10 -> ../../sdc10 lrwxrwxrwx 1 root root 11 Jul 29 20:05 pci-0000:03:00.1-ata-3.0-part11 -> ../../sdc11 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part12 -> ../../sdc12 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part13 -> ../../sdc13 lrwxrwxrwx 1 root root 11 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part14 -> ../../sdc14 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part2 -> ../../sdc2 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part3 -> ../../sdc3 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part4 -> ../../sdc4 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part5 -> ../../sdc5 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part6 -> ../../sdc6 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part7 -> ../../sdc7 lrwxrwxrwx 1 root root 10 Jul 29 20:04 pci-0000:03:00.1-ata-3.0-part8 -> ../../sdc8 lrwxrwxrwx 1 root root 10 Jul 29 20:05 pci-0000:03:00.1-ata-3.0-part9 -> ../../sdc9 cer@Telcontar:~>
that gives you an idea which drive causes that problem.
The one I thought, sdc.
for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
Such a Seagate 4TB drive is atm around 100€, do yourself a favour and replace it.
SMART testing comes out with shining colours.
But at bare minium diconnect the drive for your computer ASAP. Backup up-to-date? If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
There is no btrfs partition in there, snapper should do nothing.
I rather suspect some incompatibility between this new motherboard and this disk, which triggers once a month, or a bad cable.
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
- Yamaban
I'll prepare a backup.
Hello,
On Thu, 30 Jul 2020, Yamaban wrote:
On Thu, 30 Jul 2020 15:30, Carlos E. R. <robin.listas@...> wrote:
[..]
<snip> the next lines follow directly. but the identifier is different: before: ST4000DM004-2CV104 after: ST4000DM004-2CV1
The first is via scsi, the second via ata, compare:
# lsscsi -g [9:0:0:0] disk ATA WDC WD40EFRX-68N 82.0 /dev/sdh /dev/sg7 # hdparm -i /dev/sdh | grep Model Model=WDC WD40EFRX-68N32N0 [..] # sginfo -i /dev/sg7 |grep Prod Product: WDC WD40EFRX-68N
As you see, sginfo (the SCSI IDENTIFY) truncates the model to 16 chars, ata allows at least 22 chars (that's the longest I have).
[..]
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
It's easier to use my script ataid_to_drive.sh[1] to translate the 'ataX[.YY]' ids to /dev/sd* or SCSI HOST:CTRL:ID:LUN ids, and I've not yet found any other tool that does that (not even 'lsblk -O -a' ;)
If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
I wonder if snapper somehow does an explicit 'ATA_CMD_FLUSH_EXT' (that's the easiest to grep for in the kernel-sources) and the drive barfs on that command and only understands 'ATA_CMD_FLUSH'... I've not looked at the specs or differences...
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
Apropos: how about the 'smartctl -A' output?
HTH, -dnh
==== ataid_to_drive.sh ==== #!/bin/bash oIFS="$IFS" IFS=$'\n' PATH="/sbin:/usr/sbin:$PATH" CTRLS=( $( lspci | grep 'ATA|IDE') ) IFS="$oIFS" for arg; do if test -z "${arg/ata*}"; then arg="${arg/ata}" fi if test -z "${arg/*.*}"; then ata="${arg%.*}" subid="$(printf "%i" "${arg##*.}")"; else ata="$arg" fi echo "ata${ata}${subid/*/.$(printf "%02i" $subid)} is:" for ctrl in ${CTRLS[@]%% *}; do idpath="/sys/bus/pci/devices/*${ctrl}/*/*/*/unique_id" grep "^${ata}$" $idpath 2>/dev/null host=$(grep "^${ata}$" $idpath 2>/dev/null | \ sed 's@.*/host([0-9A-Fa-f]+)/.*@\1@') if test -n "$host"; then dmesg | grep "] s[dr] $host:0:$subid.*Attached" fi done done ====
On 30/07/2020 21.06, David Haller wrote:
Hello,
On Thu, 30 Jul 2020, Yamaban wrote:
On Thu, 30 Jul 2020 15:30, Carlos E. R. <robin.listas@...> wrote:
[..]
<snip> the next lines follow directly. but the identifier is different: before: ST4000DM004-2CV104 after: ST4000DM004-2CV1
The first is via scsi, the second via ata, compare:
# lsscsi -g [9:0:0:0] disk ATA WDC WD40EFRX-68N 82.0 /dev/sdh /dev/sg7 # hdparm -i /dev/sdh | grep Model Model=WDC WD40EFRX-68N32N0 [..] # sginfo -i /dev/sg7 |grep Prod Product: WDC WD40EFRX-68N
As you see, sginfo (the SCSI IDENTIFY) truncates the model to 16 chars, ata allows at least 22 chars (that's the longest I have).
[..]
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
It's easier to use my script ataid_to_drive.sh[1] to translate the 'ataX[.YY]' ids to /dev/sd* or SCSI HOST:CTRL:ID:LUN ids, and I've not yet found any other tool that does that (not even 'lsblk -O -a' ;)
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: Telcontar:~ #
It has no instructions, so I don't know what to feed it :-?
If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
I wonder if snapper somehow does an explicit 'ATA_CMD_FLUSH_EXT' (that's the easiest to grep for in the kernel-sources) and the drive barfs on that command and only understands 'ATA_CMD_FLUSH'... I've not looked at the specs or differences...
Done (with 'mc'):
/usr/src/linux/include/trace/events/libata.h 2813/11773 23% ata_opcode_name(ATA_CMD_FLUSH_EXT), \
/usr/src/linux/include/linux/ata.h 8458/33975 24% ATA_CMD_FLUSH_EXT = 0xEA,
/usr/src/linux/drivers/scsi/hisi_sas/hisi_sas_main.c 3469/86785 3% case ATA_CMD_FLUSH_EXT:
/usr/src/linux/drivers/ide/ide-pm.c 5098/7455 68% cmd.tf.command = ATA_CMD_FLUSH_EXT;
/usr/src/linux/drivers/ide/ide-disk.c 12562/19789 63% cmd->tf.command = ATA_CMD_FLUSH_EXT;
/usr/src/linux/drivers/ide/ide-disk.c 15685/19789 79% cmd.tf.command = ATA_CMD_FLUSH_EXT;
/usr/src/linux/drivers/ata/libata-scsi.c 42816/126K 33% tf->command = ATA_CMD_FLUSH_EXT;
/usr/src/linux/drivers/ata/libata-eh.c 6561/109K 5% { .commands = CMDS(ATA_CMD_FLUSH, ATA_CMD_FLUSH_EXT), .timeouts = ata_eh_flush_timeouts },
/usr/src/linux/drivers/ata/libata-eh.c 62785/109K 56% { ATA_CMD_FLUSH_EXT, "FLUSH CACHE EXT" },
/usr/src/linux/drivers/ata/libata-eh.c 92983/109K 83% if (qc->dev != dev || (qc->tf.command != ATA_CMD_FLUSH_EXT && qc->tf.command != ATA_CMD_FLUSH)) return 0;
Anything interesting there?
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
Apropos: how about the 'smartctl -A' output?
Sure. I intend to trigger a long test before going to sleep tonight, though. Or perhaps, download the seagate test tool and boot it, running the test from CPU instead of disk firmware because I think it tests the cable.
Telcontar:~ # smartctl -A /dev/sdc smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.52-default] (SUSE RPM) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 081 062 006 Pre-fail Always - 139054323 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 908 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 701010635 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10395 (128 204 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 906 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 061 040 Old_age Always - 33 (Min/Max 25/34) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 314 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1138 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 16 0 0 0) 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 139054323 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10360h+29m+01.373s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 16641668645 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 29138379095
Telcontar:~ #
Hello,
On Thu, 30 Jul 2020, Carlos E. R. wrote:
On 30/07/2020 21.06, David Haller wrote:
On Thu, 30 Jul 2020, Yamaban wrote:
[..]
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
It's easier to use my script ataid_to_drive.sh[1] to translate the 'ataX[.YY]' ids to /dev/sd* or SCSI HOST:CTRL:ID:LUN ids, and I've not yet found any other tool that does that (not even 'lsblk -O -a' ;)
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: Telcontar:~ #
It has no instructions, so I don't know what to feed it :-?
You did correct, it should have shown something like this:
$ ataid_to_drive.sh ata3.00 ata3.00 is: [ 1.717050] sd 2:0:0:0: [sdc] Attached SCSI disk [ 3.301179] sd 2:0:0:0: Attached scsi generic sg2 type 0
Hm. Seems like the kernel changed the sysfs contents (and thus also udev's by-path links? I only have e.g. by-path/pci-0000:00:11.0-scsi-0:0:0:0-part1)...
Ah, yes. This should work:
==== #!/bin/bash oIFS="$IFS" IFS=$'\n' PATH="/sbin:/usr/sbin:$PATH" CTRLS=( $( lspci | grep 'ATA|IDE') ) IFS="$oIFS" for arg; do if test -z "${arg/ata*}"; then arg="${arg/ata}" fi if test -z "${arg/*.*}"; then ata="${arg%.*}" subid="$(printf "%i" "${arg##*.}")"; else ata="$arg" fi echo "ata${ata}${subid/*/.$(printf "%02i" $subid)} is:" for ctrl in ${CTRLS[@]%% *}; do CPATH=$(echo "/sys/bus/pci/devices/"*"${ctrl}/") if test -d "${CPATH}/ata${ata}/"; then host=$(echo "${CPATH}/ata${ata}/"host*) fi if test -d "${host}"; then host="${host##*/}" host="${host/host}" else idpath="${CPATH}/*/*/*/unique_id" host=$(grep "^${ata}$" $idpath 2>/dev/null | \ sed 's@.*/host([0-9A-Fa-f]+)/.*@\1@') fi if test -n "$host"; then dmesg | grep "] s[dr] $host:0:$subid.*Attached" fi done done ====
Anyway, this "manual" mode might work:
$ HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') $ dmesg | grep "] s[dr] ${HOSTID}:0:0.*Attached"
If you have stuff with multiple LUNs (i.e. ata3.1 etc.), you need to adjust to ${HOSTID}:0:${SUBID} or something...
If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
I wonder if snapper somehow does an explicit 'ATA_CMD_FLUSH_EXT' (that's the easiest to grep for in the kernel-sources) and the drive barfs on that command and only understands 'ATA_CMD_FLUSH'... I've not looked at the specs or differences...
Done (with 'mc'):
[..]
Anything interesting there?
That's what I looked at (and a bit), but can't say anything about the difference between regular and _EXT flush. I mentioned that just for anyone interested do go digging deeper as a good starting point ;) And yeah, mc is wonderful for digs in the kernel- (and other) sources :))
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
Apropos: how about the 'smartctl -A' output?
Sure. I intend to trigger a long test before going to sleep tonight, though. Or perhaps, download the seagate test tool and boot it, running the test from CPU instead of disk firmware because I think it tests the cable.
Telcontar:~ # smartctl -A /dev/sdc smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.52-default] (SUSE RPM)
That's a bit dusty, not that it matters that much as long as you update your drivedb.h (use the included 'update-smart-drivedb' ;) Current seems to be 'smartmontools release 7.1 dated 2019-12-30'... But I digress once again.
Let me reorder output to better fit commenting ;)
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 908 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10395 (128 204 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 906 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 061 040 Old_age Always - 33 (Min/Max 25/34) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1138 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 16 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10360h+29m+01.373s
This all looks very good, you keep it running rather long on average for a non 24/7 system ;) But who's I to say anything...
$ calc 'printf("Carlos: %.1fh\nDavid: %.1fh\n", 10395/906, 44876/2714)' Carlos: ~11.5h David: ~16.5h
(that's my SSD with the OSes ;)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 16641668645 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 29138379095
Hm. Is that in 512B or 4K LBAs? either way, that 4T disk does not get much use ;) My 128G SSD has ~6.2 TBW, and I tried to spare it for a long time[1].
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 314
That's an indication of unclean powerlosses... Nothing to worry, but you might want to stop the cause.
1 Raw_Read_Error_Rate 0x000f 081 062 006 Pre-fail Always - 139054323 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 701010635 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 139054323
Now, these would be quite suspicious. Were this not a Seagate drive :) ISTR they come out of the box something like this...
Anyway, I concur that it's likely a flaky cable. But keep that "cache flush" issue in mind. You might try to trigger it with (while tailing messages) unmounting all partitions on that disk and then, after reading it all up in 'man hdparm':
# sync # hdparm -f /dev/sdc # hdparm -F /dev/sdc # hdparm -y /dev/sdc maybe: # hdparm -Y /dev/sdc
BTW: you might want to configure smartd to not log all temp-changes.
Something like:
DEFAULT -I 190 -I 194 -I 231 -W 1,40,45
or some such, see 'man 5 smartd.conf'... Not sure if you have to disable the "DEVICESCAN" feature for that.
HTH, -dnh
[1] (trimmed) 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 44876 12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 2714 177 Wear_Leveling_Count 0x0013 092 092 000 Pre-fail Always - 256 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 13298779544
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Friday, 2020-07-31 at 05:26 +0200, David Haller wrote:
Hello,
On Thu, 30 Jul 2020, Carlos E. R. wrote:
On 30/07/2020 21.06, David Haller wrote:
On Thu, 30 Jul 2020, Yamaban wrote:
[..]
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3
that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.
It's easier to use my script ataid_to_drive.sh[1] to translate the 'ataX[.YY]' ids to /dev/sd* or SCSI HOST:CTRL:ID:LUN ids, and I've not yet found any other tool that does that (not even 'lsblk -O -a' ;)
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: Telcontar:~ #
It has no instructions, so I don't know what to feed it :-?
You did correct, it should have shown something like this:
$ ataid_to_drive.sh ata3.00 ata3.00 is: [ 1.717050] sd 2:0:0:0: [sdc] Attached SCSI disk [ 3.301179] sd 2:0:0:0: Attached scsi generic sg2 type 0
Hm. Seems like the kernel changed the sysfs contents (and thus also udev's by-path links? I only have e.g. by-path/pci-0000:00:11.0-scsi-0:0:0:0-part1)...
Ah, yes. This should work:
Let's see:
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: [ 2.499081] sd 2:0:0:0: Attached scsi generic sg2 type 0 [ 2.545493] sd 2:0:0:0: [sdc] Attached SCSI disk Telcontar:~ #
Yes, sdc :-)
But it is SATA, not SCSI
Anyway, this "manual" mode might work:
$ HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') $ dmesg | grep "] s[dr] ${HOSTID}:0:0.*Attached"
Let's see.
Hum.
Telcontar:~ # HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') sed: -e expression #1, char 1: unknown command: `@' Telcontar:~ #
I'm not fluent with sed, so I can't see what the error is.
Telcontar:~ # ls -d /sys/bus/pci/devices/*/ata3/host* /sys/bus/pci/devices/0000:03:00.1/ata3/host2 Telcontar:~ #
If you have stuff with multiple LUNs (i.e. ata3.1 etc.), you need to adjust to ${HOSTID}:0:${SUBID} or something...
If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time
I wonder if snapper somehow does an explicit 'ATA_CMD_FLUSH_EXT' (that's the easiest to grep for in the kernel-sources) and the drive barfs on that command and only understands 'ATA_CMD_FLUSH'... I've not looked at the specs or differences...
Done (with 'mc'):
[..]
Anything interesting there?
That's what I looked at (and a bit), but can't say anything about the difference between regular and _EXT flush. I mentioned that just for anyone interested do go digging deeper as a good starting point ;) And yeah, mc is wonderful for digs in the kernel- (and other) sources :))
Ah, so perhaps dig into the sources of snapper. Mmm.
Replace the drive with a new one.
Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify.
Apropos: how about the 'smartctl -A' output?
Sure. I intend to trigger a long test before going to sleep tonight, though. Or perhaps, download the seagate test tool and boot it, running the test from CPU instead of disk firmware because I think it tests the cable.
Telcontar:~ # smartctl -A /dev/sdc smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.52-default] (SUSE RPM)
That's a bit dusty, not that it matters that much as long as you update your drivedb.h (use the included 'update-smart-drivedb' ;) Current seems to be 'smartmontools release 7.1 dated 2019-12-30'... But I digress once again.
Well, I update it when some new disk is not handled, otherwise, no :-)
Telcontar:~ # l /etc/smart_drivedb.h - -rw-r--r-- 1 root root 91 Mar 17 2019 /etc/smart_drivedb.h
I did it recently on another machine that has problems:
Isengard:~ # l /etc/smart_drivedb.h - -rw-r--r-- 1 root root 91 May 26 09:00 /etc/smart_drivedb.h
Let me reorder output to better fit commenting ;)
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 908 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10395 (128 204 0) 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 906 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 067 061 040 Old_age Always - 33 (Min/Max 25/34) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1138 194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 16 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10360h+29m+01.373s
This all looks very good, you keep it running rather long on average for a non 24/7 system ;) But who's I to say anything...
Yes, it is often on for the day. But where do you see that?
$ calc 'printf("Carlos: %.1fh\nDavid: %.1fh\n", 10395/906, 44876/2714)' Carlos: ~11.5h David: ~16.5h
Ah, total hours / power cycles.
Yes, on the previous machine hybernation could crash, so I did not hibernate at half day. On this machine there is no such problem, so, reading from the new /dev/nvme0 disk
Power Cycles: 242 Power On Hours: 188
0.776 hours per cycle? Can't be.
Let's see another rotating disk I replaced rather recently:
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5468 (68 100 0) 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 565
9.67 hours per cycle. Yes, it is getting shorter.
(that's my SSD with the OSes ;)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 16641668645 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 29138379095
Hm. Is that in 512B or 4K LBAs? either way, that 4T disk does not get much use ;) My 128G SSD has ~6.2 TBW, and I tried to spare it for a long time[1].
Telcontar:~ # fdisk -l /dev/sdc Disk /dev/sdc: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors Disk model: ST4000DM004-2CV1 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt
Basically, it has the home partitions and some data storage.
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 314
That's an indication of unclean powerlosses... Nothing to worry, but you might want to stop the cause.
1 Raw_Read_Error_Rate 0x000f 081 062 006 Pre-fail Always - 139054323 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 701010635 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 139054323
Now, these would be quite suspicious. Were this not a Seagate drive :) ISTR they come out of the box something like this...
Yes, they have large error rates compared to other brands.
I have a new disk of that size and brand on another computer (home server):
Same sector size:
Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes
Model Family: Seagate Barracuda 3.5 Device Model: ST4000DM004-2CV104 Firmware Version: 0001
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 398 (229 88 0) 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 1
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 10543574266 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1416201
1 Raw_Read_Error_Rate 0x000f 078 070 006 Pre-fail Always - 56542912 7 Seek_Error_Rate 0x000f 068 060 045 Pre-fail Always - 5720154 195 Hardware_ECC_Recovered 0x001a 078 070 000 Old_age Always - 56542912
The "problem" disk is:
Model Family: Seagate BarraCuda 3.5 Device Model: ST4000DM004-2CV104 Firmware Version: 0001
What, same model after some years?
The thing is, this model is listed as being the new SMR type. But it isn't:
Isengard:~ # hdparm -I /dev/sde | grep -i TRIM Isengard:~ #
Anyway, I concur that it's likely a flaky cable. But keep that "cache flush" issue in mind. You might try to trigger it with
How? Ah, below.
(while tailing messages) unmounting all partitions on that disk and then, after reading it all up in 'man hdparm':
# sync # hdparm -f /dev/sdc # hdparm -F /dev/sdc # hdparm -y /dev/sdc maybe: # hdparm -Y /dev/sdc
Ah, this would trigger it?
BTW: you might want to configure smartd to not log all temp-changes.
Well, it is noise mostly, I agree, but I did not bother.
Something like:
DEFAULT -I 190 -I 194 -I 231 -W 1,40,45
or some such, see 'man 5 smartd.conf'... Not sure if you have to disable the "DEVICESCAN" feature for that.
When I remember, I do disable it, yes. I don't always remember.
HTH, -dnh
[1] (trimmed) 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 44876 12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 2714 177 Wear_Leveling_Count 0x0013 092 092 000 Pre-fail Always - 256 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 13298779544
I must remember to shop cables today. I looked yesterday and the shop had only one type.
- -- Cheers, Carlos E. R. (from openSUSE 15.1 x86_64 at Telcontar)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Friday, 2020-07-31 at 12:48 +0200, Carlos E. R. wrote:
...
Telcontar:~ # smartctl -A /dev/sdc smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.12.14-lp151.28.52-default] (SUSE RPM)
That's a bit dusty, not that it matters that much as long as you update your drivedb.h (use the included 'update-smart-drivedb' ;) Current seems to be 'smartmontools release 7.1 dated 2019-12-30'... But I digress once again.
Well, I update it when some new disk is not handled, otherwise, no :-)
Telcontar:~ # l /etc/smart_drivedb.h -rw-r--r-- 1 root root 91 Mar 17 2019 /etc/smart_drivedb.h
I did it recently on another machine that has problems:
Isengard:~ # l /etc/smart_drivedb.h -rw-r--r-- 1 root root 91 May 26 09:00 /etc/smart_drivedb.h
Huh, wrong file.
The long test finished just now (no errors), and I tried to update:
Telcontar:~ # update-smart-drivedb /usr/share/smartmontools/drivedb.h is already up to date Telcontar:~ #
So it was already current.
Telcontar:~ # l /usr/share/smartmontools/drivedb.h - -rw-r--r-- 1 root root 218421 Jun 29 12:33 /usr/share/smartmontools/drivedb.h Telcontar:~ #
- -- Cheers, Carlos E. R. (from openSUSE 15.1 x86_64 at Telcontar)
Hello,
On Fri, 31 Jul 2020, Carlos E. R. wrote:
On Friday, 2020-07-31 at 05:26 +0200, David Haller wrote:
[..]
Hm. Seems like the kernel changed the sysfs contents (and thus also udev's by-path links? I only have e.g. by-path/pci-0000:00:11.0-scsi-0:0:0:0-part1)...
Ah, yes. This should work:
Let's see:
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: [ 2.499081] sd 2:0:0:0: Attached scsi generic sg2 type 0 [ 2.545493] sd 2:0:0:0: [sdc] Attached SCSI disk Telcontar:~ #
Yes, sdc :-)
But it is SATA, not SCSI
But the SCSI subsystem ;)
Anyway, this "manual" mode might work:
$ HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') $ dmesg | grep "] s[dr] ${HOSTID}:0:0.*Attached"
Let's see.
Hum.
Telcontar:~ # HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') sed: -e expression #1, char 1: unknown command: `@' Telcontar:~ #
I'm not fluent with sed, so I can't see what the error is.
Telcontar:~ # ls -d /sys/bus/pci/devices/*/ata3/host* /sys/bus/pci/devices/0000:03:00.1/ata3/host2 Telcontar:~ #
*grr* There's a 's' gone missing in copy & pasting...
HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed 's@.*/host@@') ^ that bugger
Basically it gets the number(s) at the end of the 'host', e.g. a '2' if you have e.g. /sys/bus/pci/devices/0000:03:00.1/ata3/host2/. And then you grep dmesg for e.g. '] s[dr] 2:0:0.*Attached' which catches the above 2 lines from the script.
That's what I looked at (and a bit), but can't say anything about the difference between regular and _EXT flush. I mentioned that just for anyone interested do go digging deeper as a good starting point ;) And yeah, mc is wonderful for digs in the kernel- (and other) sources :))
Ah, so perhaps dig into the sources of snapper. Mmm.
That would be a possibility ;)
[..]
1 Raw_Read_Error_Rate 0x000f 081 062 006 Pre-fail Always - 139054323 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 701010635 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 139054323
Now, these would be quite suspicious. Were this not a Seagate drive :) ISTR they come out of the box something like this...
Yes, they have large error rates compared to other brands.
Or rather, the numbers are somewhat random / bogus, but way too high anyway on Seagates.
I have a new disk of that size and brand on another computer (home server):
Same sector size:
Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes
Model Family: Seagate Barracuda 3.5 Device Model: ST4000DM004-2CV104 Firmware Version: 0001
[..]
1 Raw_Read_Error_Rate 0x000f 078 070 006 Pre-fail Always - 56542912 7 Seek_Error_Rate 0x000f 068 060 045 Pre-fail Always - 5720154 195 Hardware_ECC_Recovered 0x001a 078 070 000 Old_age Always - 56542912
same spiel ;)
The "problem" disk is:
Model Family: Seagate BarraCuda 3.5 Device Model: ST4000DM004-2CV104 Firmware Version: 0001
What, same model after some years?
The thing is, this model is listed as being the new SMR type. But it isn't:
Isengard:~ # hdparm -I /dev/sde | grep -i TRIM Isengard:~ #
Didn't I read something of "drive-managed SMR"?
Anyway, I concur that it's likely a flaky cable. But keep that "cache flush" issue in mind. You might try to trigger it with
How? Ah, below.
(while tailing messages) unmounting all partitions on that disk and then, after reading it all up in 'man hdparm':
# sync # hdparm -f /dev/sdc # hdparm -F /dev/sdc # hdparm -y /dev/sdc maybe: # hdparm -Y /dev/sdc
Ah, this would trigger it?
I think -f/-F send a flush(_ext?), and -y/-Y tells it to shut off also with a flush(_ext?) I guess...
I routinely use 'hdparm -y' to shut down external disks before switching off the power of the case. Might be worth a try to do that while watching dmesg/messages. Unless you cut power, you can wake the disk up again, e.g. by using 'mount', hdparm -I or whatever. ;)
BTW: you might want to configure smartd to not log all temp-changes.
Well, it is noise mostly, I agree, but I did not bother.
Thing is for relevant stuff not to get overlooked in all the noise. Improving the SNR in logfiles is good ;)
Something like:
DEFAULT -I 190 -I 194 -I 231 -W 1,40,45
or some such, see 'man 5 smartd.conf'... Not sure if you have to disable the "DEVICESCAN" feature for that.
When I remember, I do disable it, yes. I don't always remember.
Do you add disks so often? I have some "placeholders" for external disks:
/dev/sdh [regular internal disk] /dev/sdi -d sat -n standby,q -a -I 194 -I 190 -I 231 -I 9 \ -T permissive -d removable /dev/sdj [..same...] /dev/sdk [..same...]
HTH, -dnh
On 01/08/2020 07.23, David Haller wrote:
Hello,
On Fri, 31 Jul 2020, Carlos E. R. wrote:
On Friday, 2020-07-31 at 05:26 +0200, David Haller wrote:
[..]
Hm. Seems like the kernel changed the sysfs contents (and thus also udev's by-path links? I only have e.g. by-path/pci-0000:00:11.0-scsi-0:0:0:0-part1)...
Ah, yes. This should work:
Let's see:
Telcontar:~ # ataid_to_drive.sh ata3.00 ata3.00 is: [ 2.499081] sd 2:0:0:0: Attached scsi generic sg2 type 0 [ 2.545493] sd 2:0:0:0: [sdc] Attached SCSI disk Telcontar:~ #
Yes, sdc :-)
But it is SATA, not SCSI
But the SCSI subsystem ;)
Anyway, this "manual" mode might work:
$ HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') $ dmesg | grep "] s[dr] ${HOSTID}:0:0.*Attached"
Let's see.
Hum.
Telcontar:~ # HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed '@.*/host@@') sed: -e expression #1, char 1: unknown command: `@' Telcontar:~ #
I'm not fluent with sed, so I can't see what the error is.
Telcontar:~ # ls -d /sys/bus/pci/devices/*/ata3/host* /sys/bus/pci/devices/0000:03:00.1/ata3/host2 Telcontar:~ #
*grr* There's a 's' gone missing in copy & pasting...
HOSTID=$(ls -d /sys/bus/pci/devices/*/ata3/host* | sed 's@.*/host@@')
Then:
Telcontar:~ # dmesg | grep "] s[dr] ${HOSTID}:0:0.*Attached" [ 2.499081] sd 2:0:0:0: Attached scsi generic sg2 type 0 [ 2.545493] sd 2:0:0:0: [sdc] Attached SCSI disk Telcontar:~ #
^ that bugger
Basically it gets the number(s) at the end of the 'host', e.g. a '2' if you have e.g. /sys/bus/pci/devices/0000:03:00.1/ata3/host2/. And then you grep dmesg for e.g. '] s[dr] 2:0:0.*Attached' which catches the above 2 lines from the script.
Ok!
That's what I looked at (and a bit), but can't say anything about the difference between regular and _EXT flush. I mentioned that just for anyone interested do go digging deeper as a good starting point ;) And yeah, mc is wonderful for digs in the kernel- (and other) sources :))
Ah, so perhaps dig into the sources of snapper. Mmm.
That would be a possibility ;)
[..]
1 Raw_Read_Error_Rate 0x000f 081 062 006 Pre-fail Always - 139054323 7 Seek_Error_Rate 0x000f 088 060 045 Pre-fail Always - 701010635 195 Hardware_ECC_Recovered 0x001a 081 064 000 Old_age Always - 139054323
Now, these would be quite suspicious. Were this not a Seagate drive :) ISTR they come out of the box something like this...
Yes, they have large error rates compared to other brands.
Or rather, the numbers are somewhat random / bogus, but way too high anyway on Seagates.
I have a new disk of that size and brand on another computer (home server):
Same sector size:
Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes
Model Family: Seagate Barracuda 3.5 Device Model: ST4000DM004-2CV104 Firmware Version: 0001
[..]
1 Raw_Read_Error_Rate 0x000f 078 070 006 Pre-fail Always - 56542912 7 Seek_Error_Rate 0x000f 068 060 045 Pre-fail Always - 5720154 195 Hardware_ECC_Recovered 0x001a 078 070 000 Old_age Always - 56542912
same spiel ;)
The "problem" disk is:
Model Family: Seagate BarraCuda 3.5 Device Model: ST4000DM004-2CV104 Firmware Version: 0001
What, same model after some years?
The thing is, this model is listed as being the new SMR type. But it isn't:
Isengard:~ # hdparm -I /dev/sde | grep -i TRIM Isengard:~ #
Didn't I read something of "drive-managed SMR"?
Dunno.
Google:
Search Results Featured snippet from the web Drive-managed SMR, where the drive manages all write commands from the host, allows a plug-and-play implementation, compatible with any hardware and software. However, the background 'housekeeping' tasks that the drive must perform result in highly unpredictable performance, unfit for enterprise workloads.
Making Host Managed SMR Work for You – Dropbox's ...
https://blog.westerndigital.com/host-managed-smr-dropbox/
And that's the other thing, the speed is consistent and "fast". On the server, I overwrote it completely, and it did about 130MB/S. For several hours it went at 170 or so. People say that SMR gets down to 40 on large writes, and mine was a hell of a large write.
The "server" has a low power cpu (Intel(R) Pentium(R) CPU N3710 @ 1.60GHz), and the hard disks are external via usb3. And the writing was filling the entire disk with almost random data with a concoction of mine that does it fast - all of that might account for lowering the speed when I was not looking.
Anyway, I concur that it's likely a flaky cable. But keep that "cache flush" issue in mind. You might try to trigger it with
How? Ah, below.
(while tailing messages) unmounting all partitions on that disk and then, after reading it all up in 'man hdparm':
# sync # hdparm -f /dev/sdc # hdparm -F /dev/sdc # hdparm -y /dev/sdc maybe: # hdparm -Y /dev/sdc
Ah, this would trigger it?
I think -f/-F send a flush(_ext?), and -y/-Y tells it to shut off also with a flush(_ext?) I guess...
I routinely use 'hdparm -y' to shut down external disks before switching off the power of the case. Might be worth a try to do that while watching dmesg/messages. Unless you cut power, you can wake the disk up again, e.g. by using 'mount', hdparm -I or whatever. ;)
I'll put that in the to-do list :-)
Meanwhile, I replaced the 4 sata cables. I think they were a bit forced by the metal cover of the box.
BTW: you might want to configure smartd to not log all temp-changes.
Well, it is noise mostly, I agree, but I did not bother.
Thing is for relevant stuff not to get overlooked in all the noise. Improving the SNR in logfiles is good ;)
That's true.
Something like:
DEFAULT -I 190 -I 194 -I 231 -W 1,40,45
or some such, see 'man 5 smartd.conf'... Not sure if you have to disable the "DEVICESCAN" feature for that.
When I remember, I do disable it, yes. I don't always remember.
Do you add disks so often? I have some "placeholders" for external disks:
This machine doesn't have external sata connectors. My previous machine did. :-(
So my temporary external disks are tested manually.
/dev/sdh [regular internal disk] /dev/sdi -d sat -n standby,q -a -I 194 -I 190 -I 231 -I 9 \ -T permissive -d removable /dev/sdj [..same...] /dev/sdk [..same...]
It is simply that I have so many things in my head that I forget to do things.
On 2020-07-30 15:30, Carlos E. R. wrote:
I did a long SMART test the previous time on all disks, found nothing.
What should I look at?
cabling?
Have a nice day, Berny
On 30/07/2020 20.08, Bernhard Voelker wrote:
On 2020-07-30 15:30, Carlos E. R. wrote:
I did a long SMART test the previous time on all disks, found nothing.
What should I look at?
cabling?
That's my suspicion, yes. I intend to buy a cable or two and replace it.
On 7/30/20 8:30 AM, Carlos E. R. wrote:
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Boot Error: Emask 0X0 SAct 0X0 SErr 0X0 action 0X6 frozen https://unix.stackexchange.com/questions/466376/boot-error-emask-0x0-sact-0x...
On 30/07/2020 23.56, David C. Rankin wrote:
On 7/30/20 8:30 AM, Carlos E. R. wrote:
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Boot Error: Emask 0X0 SAct 0X0 SErr 0X0 action 0X6 frozen https://unix.stackexchange.com/questions/466376/boot-error-emask-0x0-sact-0x...
It says:
«Your SATA controller gives error messages back for the read and write commands.
The error is timeout, which says that the controller can't communicate with your hard disks, more exactly it doesn't get answer from them.
The most likely cause of the problem is contact problem with your SATA cables. Check with other hard disk, other cable, plug the cable into different slots and so on.»
Ok... that matches.
On 31/07/2020 00.25, Carlos E.R. wrote:
On 30/07/2020 23.56, David C. Rankin wrote:
On 7/30/20 8:30 AM, Carlos E. R. wrote:
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Boot Error: Emask 0X0 SAct 0X0 SErr 0X0 action 0X6 frozen https://unix.stackexchange.com/questions/466376/boot-error-emask-0x0-sact-0x...
It says:
«Your SATA controller gives error messages back for the read and write commands.
The error is timeout, which says that the controller can't communicate with your hard disks, more exactly it doesn't get answer from them.
The most likely cause of the problem is contact problem with your SATA cables. Check with other hard disk, other cable, plug the cable into different slots and so on.»
Ok... that matches.
I just replaced the 4 sata cables, with angled versions. The straight connectors were bumping hard into the metal back cover of the box. Perhaps that was it.
I remembered that I have two identical disks:
Telcontar:~ # fdisk -l /dev/sdc Disk /dev/sdc: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors Disk model: ST4000DM004-2CV1 Units: sectors of 1 * 512 = 512 bytes ...
Telcontar:~ # fdisk -l /dev/sdd Disk /dev/sdd: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors Disk model: ST4000DM004-2CV1 Units: sectors of 1 * 512 = 512 bytes ....
Only sdc has the problem, so it can not be firmware or software. Shouldn't.