On Thu, 30 Jul 2020 15:30, Carlos E. R.
Hi
I had a hard disk crash last month and another one yesterday. I had to hit the hardware reset button to "recover".
First occurrence:
<0.3> 2020-06-22T13:01:38.170168+02:00 Telcontar kernel - - - [250558.404347] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen <0.3> 2020-06-22T13:01:38.170183+02:00 Telcontar kernel - - - [250558.404350] ata3.00: failed command: FLUSH CACHE EXT <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22 <0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) <0.3> 2020-06-22T13:01:38.170185+02:00 Telcontar kernel - - - [250558.404354] ata3.00: status: { DRDY } <0.6> 2020-06-22T13:01:38.170186+02:00 Telcontar kernel - - - [250558.404357] ata3: hard resetting link <snip> Failed drive is connected via ata3 <snip> The disk or the interface can be identified on next boot:
<0.6> 2020-06-22T13:08:00.540722+02:00 Telcontar kernel - - - [ 2.512438] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) <0.6> 2020-06-22T13:08:00.540723+02:00 Telcontar kernel - - - [ 2.515338] ata3.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133 <0.6> 2020-06-22T13:08:00.540725+02:00 Telcontar kernel - - - [ 2.515343] ata3.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA <0.6> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.517902] ata3.00: configured for UDMA/133 <snip>
<snip> the next lines follow directly. but the identifier is different: before: ST4000DM004-2CV104 after: ST4000DM004-2CV1 <snip>
<0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.518066] scsi 2:0:0:0: Direct-Access ATA ST4000DM004-2CV1 0001 PQ: 0 ANSI: 5 <0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [ 2.518210] sd 2:0:0:0: Attached scsi generic sg2 type 0 <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ 2.518262] sd 2:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB) <0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [ 2.518267] sd 2:0:0:0: [sdc] 4096-byte physical blocks
<0.6> 2020-06-22T13:08:00.540764+02:00 Telcontar kernel - - - [ 2.578542] sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 sdc11 sdc12 sdc13 sdc14 <0.5> 2020-06-22T13:08:00.540765+02:00 Telcontar kernel - - - [ 2.578979] sd 2:0:0:0: [sdc] Attached SCSI disk <snip> <0.3> 2020-07-29T20:01:21.690682+02:00 Telcontar kernel - - - [320053.217028] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen <0.3> 2020-07-29T20:01:21.690694+02:00 Telcontar kernel - - - [320053.217030] ata3.00: failed command: FLUSH CACHE EXT <0.3> 2020-07-29T20:01:21.690695+02:00 Telcontar kernel - - - [320053.217034] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 6 <0.3> 2020-07-29T20:01:21.690696+02:00 Telcontar kernel - - - [320053.217034] res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) <0.3> 2020-07-29T20:01:21.690697+02:00 Telcontar kernel - - - [320053.217036] ata3.00: status: { DRDY }
first: look at the links in this dir: /dev/disk/by-path esp the lines that contain ata3: :> ls -l /dev/disk/by-path |grep ata3 that gives you an idea which drive causes that problem. for more info what drive that is, use the "sd?" drive-id and look at the content of "/dev/disk/by-id" with that id e.g. "sdd" :> ls -l /dev/disk/by-id |grep sdd should give something similar to "ata-Seagate..........." with the product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive. Such a Seagate 4TB drive is atm around 100€, do yourself a favour and replace it. But at bare minium diconnect the drive for your computer ASAP. Backup up-to-date? If not stop snapper (-timer via systemd), and do so NOW. Power down. Disconnect Drive, maybe also remove it at the same time Replace the drive with a new one. Sorry, no better answer. Either the drive has a firmware bug, or a real hw failure the s.m.a.r.t system can not identify. - Yamaban