[opensuse] Re: Hard disk crash

30 Jul 2020

      On Thu, 30 Jul 2020 15:30, Carlos E. R.  wrote:
...
Hi
I had a hard disk crash last month and another one yesterday.  I had to hit
the hardware reset button to "recover".
First occurrence:
...
<0.3> 2020-06-22T13:01:38.170168+02:00 Telcontar kernel - - - [250558.404347] 
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
<0.3> 2020-06-22T13:01:38.170183+02:00 Telcontar kernel - - - [250558.404350] 
ata3.00: failed command: FLUSH CACHE EXT
<0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] 
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22
<0.3> 2020-06-22T13:01:38.170184+02:00 Telcontar kernel - - - [250558.404353] 
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
<0.3> 2020-06-22T13:01:38.170185+02:00 Telcontar kernel - - - [250558.404354] 
ata3.00: status: { DRDY }
<0.6> 2020-06-22T13:01:38.170186+02:00 Telcontar kernel - - - [250558.404357] 
ata3: hard resetting link
<snip>
Failed drive is connected via ata3
<snip>
The disk or the interface can be identified on next boot:
<0.6> 2020-06-22T13:08:00.540722+02:00 Telcontar kernel - - - [    2.512438] 
ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
<0.6> 2020-06-22T13:08:00.540723+02:00 Telcontar kernel - - - [    2.515338] 
ata3.00: ATA-10: ST4000DM004-2CV104, 0001, max UDMA/133
<0.6> 2020-06-22T13:08:00.540725+02:00 Telcontar kernel - - - [    2.515343] 
ata3.00: 7814037168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
<0.6> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [    2.517902] 
ata3.00: configured for UDMA/133
<snip>
<snip>
the next lines follow directly. but the identifier is different:
before:  ST4000DM004-2CV104  after:  ST4000DM004-2CV1
<snip>
...
<0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [    2.518066] 
scsi 2:0:0:0: Direct-Access     ATA      ST4000DM004-2CV1 0001 PQ: 0 ANSI: 5
<0.5> 2020-06-22T13:08:00.540726+02:00 Telcontar kernel - - - [    2.518210] 
sd 2:0:0:0: Attached scsi generic sg2 type 0
<0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [    2.518262] 
sd 2:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
<0.5> 2020-06-22T13:08:00.540727+02:00 Telcontar kernel - - - [    2.518267] 
sd 2:0:0:0: [sdc] 4096-byte physical blocks
<0.6> 2020-06-22T13:08:00.540764+02:00 Telcontar kernel - - - [    2.578542] 
sdc: sdc1 sdc2 sdc3 sdc4 sdc5 sdc6 sdc7 sdc8 sdc9 sdc10 sdc11 sdc12 sdc13 
sdc14
<0.5> 2020-06-22T13:08:00.540765+02:00 Telcontar kernel - - - [    2.578979] 
sd 2:0:0:0: [sdc] Attached SCSI disk
<snip>
<0.3> 2020-07-29T20:01:21.690682+02:00 Telcontar kernel - - - [320053.217028] 
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
<0.3> 2020-07-29T20:01:21.690694+02:00 Telcontar kernel - - - [320053.217030] 
ata3.00: failed command: FLUSH CACHE EXT
<0.3> 2020-07-29T20:01:21.690695+02:00 Telcontar kernel - - - [320053.217034] 
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 6
<0.3> 2020-07-29T20:01:21.690696+02:00 Telcontar kernel - - - [320053.217034] 
res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
<0.3> 2020-07-29T20:01:21.690697+02:00 Telcontar kernel - - - [320053.217036] 
ata3.00: status: { DRDY }
first:
look at the links in this dir: /dev/disk/by-path
esp the lines that contain ata3:
  :> ls -l /dev/disk/by-path |grep ata3

that gives you an idea which drive causes that problem.
for more info what drive that is, use the "sd?" drive-id and
look at the content of "/dev/disk/by-id" with that id e.g. "sdd"
  :> ls -l /dev/disk/by-id |grep sdd
should give something similar to "ata-Seagate..........." with the
product-name ST4000DM004-2CV1 in it, to exactly identify whitch drive.

Such a Seagate 4TB drive is atm around 100€, do yourself a favour and 
replace it.

But at bare minium diconnect the drive for your computer ASAP.
Backup up-to-date?
If not stop snapper (-timer via systemd), and do so NOW.
Power down. Disconnect Drive, maybe also remove it at the same time

Replace the drive with a new one.

Sorry, no better answer. Either the drive has a firmware bug,
or a real hw failure the s.m.a.r.t system can not identify.

  - Yamaban