Re: [opensuse-kernel] hard disk dying or kernel bug?

7 Oct 2014

      On 10/07/2014 08:34 AM, Ludwig Nussel wrote:
...
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have
the following messages in dmesg:
[20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6
[20727.025407] sas: trying to find task 0xffff8800375776c0
[20727.025410] sas: sas_scsi_find_task: aborting task
0xffff8800375776c0
[20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev
=           (null) (STP/SATA <NULL>), task = ffff8800375776c0,
old_request ==           (null)
[20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task
not needed for ffff8800375776c0
[20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev
=           (null), task = ffff8800375776c0 , old_request
==           (null)
[20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done
[20727.025430] sas: sas_eh_handle_sas_errors: task
0xffff8800375776c0 is done
[20727.025433] sas: trying to find task 0xffff880037577440
[20727.025435] sas: sas_scsi_find_task: aborting task
0xffff880037577440
[20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev
=           (null) (STP/SATA <NULL>), task = ffff880037577440,
old_request ==           (null)
[20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task
not needed for ffff880037577440
[20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev
=           (null), task = ffff880037577440 , old_request
==           (null)
[20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done
[20727.025450] sas: sas_eh_handle_sas_errors: task
0xffff880037577440 is done
[20727.025452] sas: trying to find task 0xffff880037577940
[20727.025454] sas: sas_scsi_find_task: aborting task
0xffff880037577940
...
[20727.025528] sas: ata7: end_device-6:0: cmd error handler
[20727.025602] sas: ata7: end_device-6:0: dev error handler
[20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0
action 0x6 frozen
[20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED
[20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag
5 ncq 32768 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[20727.025631] ata7.00: status: { DRDY }
That's an NCQ failure, most likely TLER issue (time-limited error
recovery). IE the device encountered an error which the internal
error recovery couldn't fix up.
And yes, the ATA stack doesn't handle that one well ...
...
...
[20727.025688] ata7.00: status: { DRDY }
[20727.025694] ata7: hard resetting link
[20727.219155] ata7.00: configured for UDMA/133
[20727.219164] ata7.00: device reported invalid CHS sector 0
[20727.219167] ata7.00: device reported invalid CHS sector 0
[20727.219170] ata7.00: device reported invalid CHS sector 0
[20727.219173] ata7.00: device reported invalid CHS sector 0
[20727.219176] ata7.00: device reported invalid CHS sector 0
[20727.219178] ata7.00: device reported invalid CHS sector 0
[20727.219212] ata7: EH complete
[20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed:
0 tries: 1
[40650.589614] perf interrupt took too long (2520 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000
Is the disk dying (smartctl output attached) or is it a kernel bug?
It's on it way out:

  1 Raw_Read_Error_Rate     0x000f   112   099   006    Pre-fail
Always       -       46434016

A high raw read error rate _is_ worrying.

  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail
Always       -       17265656910

And a high seek error rate even more so.
Get a new disk.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
-- 
To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org