[opensuse-kernel] hard disk dying or kernel bug?
Hi, Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the following messages in dmesg: [20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6 [20727.025407] sas: trying to find task 0xffff8800375776c0 [20727.025410] sas: sas_scsi_find_task: aborting task 0xffff8800375776c0 [20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff8800375776c0, old_request == (null) [20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff8800375776c0 [20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff8800375776c0 , old_request == (null) [20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done [20727.025430] sas: sas_eh_handle_sas_errors: task 0xffff8800375776c0 is done [20727.025433] sas: trying to find task 0xffff880037577440 [20727.025435] sas: sas_scsi_find_task: aborting task 0xffff880037577440 [20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff880037577440, old_request == (null) [20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff880037577440 [20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff880037577440 , old_request == (null) [20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done [20727.025450] sas: sas_eh_handle_sas_errors: task 0xffff880037577440 is done [20727.025452] sas: trying to find task 0xffff880037577940 [20727.025454] sas: sas_scsi_find_task: aborting task 0xffff880037577940 ... [20727.025528] sas: ata7: end_device-6:0: cmd error handler [20727.025602] sas: ata7: end_device-6:0: dev error handler [20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen [20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED [20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [20727.025631] ata7.00: status: { DRDY } ... [20727.025688] ata7.00: status: { DRDY } [20727.025694] ata7: hard resetting link [20727.219155] ata7.00: configured for UDMA/133 [20727.219164] ata7.00: device reported invalid CHS sector 0 [20727.219167] ata7.00: device reported invalid CHS sector 0 [20727.219170] ata7.00: device reported invalid CHS sector 0 [20727.219173] ata7.00: device reported invalid CHS sector 0 [20727.219176] ata7.00: device reported invalid CHS sector 0 [20727.219178] ata7.00: device reported invalid CHS sector 0 [20727.219212] ata7: EH complete [20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 [40650.589614] perf interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 Is the disk dying (smartctl output attached) or is it a kernel bug? cu Ludwig -- (o_ Ludwig Nussel //\ V_/_ http://www.suse.de/ SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg)
On 10/07/2014 08:34 AM, Ludwig Nussel wrote:
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the following messages in dmesg:
[20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6 [20727.025407] sas: trying to find task 0xffff8800375776c0 [20727.025410] sas: sas_scsi_find_task: aborting task 0xffff8800375776c0 [20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff8800375776c0, old_request == (null) [20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff8800375776c0 [20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff8800375776c0 , old_request == (null) [20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done [20727.025430] sas: sas_eh_handle_sas_errors: task 0xffff8800375776c0 is done [20727.025433] sas: trying to find task 0xffff880037577440 [20727.025435] sas: sas_scsi_find_task: aborting task 0xffff880037577440 [20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff880037577440, old_request == (null) [20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff880037577440 [20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff880037577440 , old_request == (null) [20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done [20727.025450] sas: sas_eh_handle_sas_errors: task 0xffff880037577440 is done [20727.025452] sas: trying to find task 0xffff880037577940 [20727.025454] sas: sas_scsi_find_task: aborting task 0xffff880037577940 ... [20727.025528] sas: ata7: end_device-6:0: cmd error handler [20727.025602] sas: ata7: end_device-6:0: dev error handler [20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen [20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED [20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [20727.025631] ata7.00: status: { DRDY } That's an NCQ failure, most likely TLER issue (time-limited error recovery). IE the device encountered an error which the internal error recovery couldn't fix up.
And yes, the ATA stack doesn't handle that one well ...
... [20727.025688] ata7.00: status: { DRDY } [20727.025694] ata7: hard resetting link [20727.219155] ata7.00: configured for UDMA/133 [20727.219164] ata7.00: device reported invalid CHS sector 0 [20727.219167] ata7.00: device reported invalid CHS sector 0 [20727.219170] ata7.00: device reported invalid CHS sector 0 [20727.219173] ata7.00: device reported invalid CHS sector 0 [20727.219176] ata7.00: device reported invalid CHS sector 0 [20727.219178] ata7.00: device reported invalid CHS sector 0 [20727.219212] ata7: EH complete [20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 [40650.589614] perf interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Is the disk dying (smartctl output attached) or is it a kernel bug?
It's on it way out: 1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46434016 A high raw read error rate _is_ worrying. 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 17265656910 And a high seek error rate even more so. Get a new disk. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Hannes Reinecke schrieb:
On 10/07/2014 08:34 AM, Ludwig Nussel wrote:
[...] Is the disk dying (smartctl output attached) or is it a kernel bug?
It's on it way out:
1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46434016
A high raw read error rate _is_ worrying.
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 17265656910
And a high seek error rate even more so. Get a new disk.
Will do. Thanks! :-) cu Ludwig -- (o_ Ludwig Nussel //\ V_/_ http://www.suse.de/ SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Tuesday 07 October 2014 08.40:38 Hannes Reinecke wrote:
On 10/07/2014 08:34 AM, Ludwig Nussel wrote:
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the following messages in dmesg:
[20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6 [20727.025407] sas: trying to find task 0xffff8800375776c0 [20727.025410] sas: sas_scsi_find_task: aborting task 0xffff8800375776c0 [20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff8800375776c0, old_request == (null) [20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff8800375776c0 [20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff8800375776c0 , old_request == (null) [20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done [20727.025430] sas: sas_eh_handle_sas_errors: task 0xffff8800375776c0 is done [20727.025433] sas: trying to find task 0xffff880037577440 [20727.025435] sas: sas_scsi_find_task: aborting task 0xffff880037577440 [20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff880037577440, old_request == (null) [20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff880037577440 [20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff880037577440 , old_request == (null) [20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done [20727.025450] sas: sas_eh_handle_sas_errors: task 0xffff880037577440 is done [20727.025452] sas: trying to find task 0xffff880037577940 [20727.025454] sas: sas_scsi_find_task: aborting task 0xffff880037577940 ... [20727.025528] sas: ata7: end_device-6:0: cmd error handler [20727.025602] sas: ata7: end_device-6:0: dev error handler [20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen [20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED [20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [20727.025631] ata7.00: status: { DRDY } That's an NCQ failure, most likely TLER issue (time-limited error recovery). IE the device encountered an error which the internal error recovery couldn't fix up.
And yes, the ATA stack doesn't handle that one well ...
... [20727.025688] ata7.00: status: { DRDY } [20727.025694] ata7: hard resetting link [20727.219155] ata7.00: configured for UDMA/133 [20727.219164] ata7.00: device reported invalid CHS sector 0 [20727.219167] ata7.00: device reported invalid CHS sector 0 [20727.219170] ata7.00: device reported invalid CHS sector 0 [20727.219173] ata7.00: device reported invalid CHS sector 0 [20727.219176] ata7.00: device reported invalid CHS sector 0 [20727.219178] ata7.00: device reported invalid CHS sector 0 [20727.219212] ata7: EH complete [20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 [40650.589614] perf interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Is the disk dying (smartctl output attached) or is it a kernel bug?
It's on it way out:
1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46434016
A high raw read error rate _is_ worrying.
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 17265656910
And a high seek error rate even more so. Get a new disk.
Cheers,
Hannes
Ludwig, Don't know if your actual disk can be saved, but if you have other there's seagate firmware update ( the iso is pxe, usb bootable too) Device Model: ST2000DM001-1CH164 Serial Number: Z1F3H9EJ LU WWN Device Id: 5 000c50 0643f2559 Firmware Version: CC27 User Capacity: 2'000'398'934'016 bytes [2.00 TB] I've seen several time barracuda series dying just because of not being up to date :-) -- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch openSUSE Member & Board, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-10-07 08:40, Hannes Reinecke wrote:
On 10/07/2014 08:34 AM, Ludwig Nussel wrote:
It's on it way out:
1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46434016
A high raw read error rate _is_ worrying.
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 17265656910
Not necessarily, as it is a Seagate. Look at one of mine, still young (exact same model, newer firmware (CC27)): Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 Firmware Version: CC27 User Capacity: 2,000,398,934,016 bytes [2.00 TB] 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 227236888 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 394 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2434 7 Seek_Error_Rate 0x000f 074 060 030 Pre-fail Always - 8646212908 Another disk of the same model family: Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST3000DM001-1CH166 Firmware Version: CC27 User Capacity: 3,000,592,982,016 bytes [3.00 TB] 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 175510856 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 384 7 Seek_Error_Rate 0x000f 064 060 030 Pre-fail Always - 21489215161 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2236 An older disk: Model Family: Seagate Barracuda 7200.12 Device Model: ST3500418AS Firmware Version: CC37 User Capacity: 500,107,862,016 bytes [500 GB] 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 199173627 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 094 094 020 Old_age Always - 6294 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 103648914 9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17190 Modern Seagates produce absurdly high error rates, it is normal for them. The disk may be bad, but not based on those numbers. However, his disk has 15838 hours of use. The figures from smarctl are not conclusive, as no test has been run recently: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 0 - So what I would do is run first the short test, then the long one, then verify again the figures. Later, I would activate the short test on automatic, periodically, via smartd daemon, for all disks. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlQzt3cACgkQtTMYHG2NR9XPGwCfUbM8okq+NSTdMYrbOwHTq4lb 5YMAn3rQJOAaQO9K3hLk37s4u9kiKrup =DuIp -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On October 7, 2014 2:40:38 AM EDT, Hannes Reinecke <hare@suse.de> wrote:
On 10/07/2014 08:34 AM, Ludwig Nussel wrote:
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the following messages in dmesg:
[20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6 [20727.025407] sas: trying to find task 0xffff8800375776c0 [20727.025410] sas: sas_scsi_find_task: aborting task 0xffff8800375776c0 [20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff8800375776c0, old_request == (null) [20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff8800375776c0 [20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff8800375776c0 , old_request == (null) [20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done [20727.025430] sas: sas_eh_handle_sas_errors: task 0xffff8800375776c0 is done [20727.025433] sas: trying to find task 0xffff880037577440 [20727.025435] sas: sas_scsi_find_task: aborting task 0xffff880037577440 [20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff880037577440, old_request == (null) [20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff880037577440 [20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff880037577440 , old_request == (null) [20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done [20727.025450] sas: sas_eh_handle_sas_errors: task 0xffff880037577440 is done [20727.025452] sas: trying to find task 0xffff880037577940 [20727.025454] sas: sas_scsi_find_task: aborting task 0xffff880037577940 ... [20727.025528] sas: ata7: end_device-6:0: cmd error handler [20727.025602] sas: ata7: end_device-6:0: dev error handler [20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen [20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED [20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [20727.025631] ata7.00: status: { DRDY } That's an NCQ failure, most likely TLER issue (time-limited error recovery). IE the device encountered an error which the internal error recovery couldn't fix up.
And yes, the ATA stack doesn't handle that one well ...
... [20727.025688] ata7.00: status: { DRDY } [20727.025694] ata7: hard resetting link [20727.219155] ata7.00: configured for UDMA/133 [20727.219164] ata7.00: device reported invalid CHS sector 0 [20727.219167] ata7.00: device reported invalid CHS sector 0 [20727.219170] ata7.00: device reported invalid CHS sector 0 [20727.219173] ata7.00: device reported invalid CHS sector 0 [20727.219176] ata7.00: device reported invalid CHS sector 0 [20727.219178] ata7.00: device reported invalid CHS sector 0 [20727.219212] ata7: EH complete [20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 [40650.589614] perf interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
Is the disk dying (smartctl output attached) or is it a kernel bug?
It's on it way out:
1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46434016
A high raw read error rate _is_ worrying.
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 17265656910
And a high seek error rate even more so. Get a new disk.
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect. What actual errors does the drive report (smartctl --log)? If the drive was the source of the problem it will have logged it internally. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-10-07 13:12, Greg Freemyer wrote:
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
Notice that the syntax is not that simple: Telcontar:~ # smartctl --health /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-21-desktop] (SUSE RPM) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Telcontar:~ # smartctl --log /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-21-desktop] (SUSE RPM) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org =======> INVALID ARGUMENT TO -l: /dev/sda =======> VALID ARGUMENTS ARE: error, selftest, selective, directory[,g|s], xerror[,N][,error], xselftest[,N][,selftest], background, sasphy[,reset], sataphy[,reset], scttemp[sts,hist], scttempint,N[,p], scterc[,N,M], devstat[,N], ssd, gplog,N[,RANGE], smartlog,N[,RANGE] <======= Use smartctl -h to get a usage summary Telcontar:~ # - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlQz7jQACgkQtTMYHG2NR9UHmQCfbXNJVVZcGP33luqWdNUAXLuU Nh0AoIS7ytgntS0mXdSkSm4Uo8qjR5EJ =mYRp -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Tue, Oct 7, 2014 at 9:44 AM, Carlos E. R. <carlos.e.r@opensuse.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 2014-10-07 13:12, Greg Freemyer wrote:
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
Notice that the syntax is not that simple:
Here you go from one of my computers:
sudo /usr/sbin/smartctl --log=error /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-21-desktop] (SUSE RPM) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART Error Log Version: 1 No Errors Logged -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Greg Freemyer schrieb:
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
If the drive was the source of the problem it will have logged it internally.
After having backed up my data I ran the long self test. No errors logged. So I'll give the cable a try. Thanks for hint! Also, we don't have smartd enabled by default. I guess we should though. cu Ludwig -- (o_ Ludwig Nussel //\ V_/_ http://www.suse.de/ SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On October 8, 2014 4:23:02 AM EDT, Ludwig Nussel <ludwig.nussel@suse.de> wrote:
Greg Freemyer schrieb:
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
If the drive was the source of the problem it will have logged it internally.
After having backed up my data I ran the long self test. No errors logged. So I'll give the cable a try. Thanks for hint! Also, we don't have smartd enabled by default. I guess we should though.
cu Ludwig
Your comment about smartd seems like a non-sequitur. Smartd has the job of copying internal disk logs to syslog so admins will notice them. The internal disk log exists independent of what smartd is doing. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2014-10-08 12:57, Greg Freemyer wrote:
Your comment about smartd seems like a non-sequitur.
Smartd has the job of copying internal disk logs to syslog so admins will notice them. The internal disk log exists independent of what smartd is doing.
It does more. smartd also can run short/long tests on the disks, periodically, and send you emails with reports or problems. You can define your own script and send SMSs to your phone, for instance. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlQ1HUMACgkQtTMYHG2NR9W0qQCfRgdas5wXjp4ltiXE742gIZXF X4UAnAwX4XCmCc7+FILoLyvCx0oBWRVR =2LkQ -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 08/10/14 19:23, Ludwig Nussel wrote:
Greg Freemyer schrieb:
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
If the drive was the source of the problem it will have logged it internally.
After having backed up my data I ran the long self test. No errors logged. So I'll give the cable a try. Thanks for hint!
Meant to respond to this re the cable yesterday but other things got in the way...... Have a look at the cable to see if it is a red coloured one in which case the probability of it having to be replaced are high. At one point I was running another distro. and while on it's mail list I read a comment by a tech who was an "old hand" at the game. His comment was that cables with red plastic coatings were causing problems but he couldn't figure out why until one of his own computers "went down" and he actually decided to "pull" one of these cables apart. What he found was that the copper wires inside the cable had "rotted" away - caused by some chemicals used in the production of the cables. Not an urban myth, BTW. I can see such cables available at stores which sell components at the "lower end of the market" and always avoid them. (This advice is along the lines of what occurred some years ago when cables became available which made it *so* much more easy to connect HDDs to the m/board because the cables where longer. The only hassle was that these cables were too long and caused signal-bounce which caused corruption of data written to/read from the HDDs. The designed max length of cables was (?)18 inches but the new ones were longer thus causing the data corruption.) [pruned] BC -- Using openSUSE 13.2, KDE 4.14.1 & kernel 3.16.3-1 on a system with- AMD FX 8-core 3.6/4.2GHz processor 16GB PC14900/1866MHz Quad Channel RAM Gigabyte AMD3+ m/board; Gigabyte nVidia GTX660 GPU -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 07/10/14 17:34, Ludwig Nussel wrote:
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the following messages in dmesg:
[pruned] See the thread "Interpretation please" in 'opensuse help' which I started on 1 Sept 2014. If you are, as I was, concerned about the results produced by smartctl then send them to Seagate and ask them for their comment. The reply I got was that I had no problems, but to confirm I could run Seatools against the HDDs I have. BTW, the responses I got by asking that question in HELP ranged from. "The sky is falling!" to, "Nothing t worry about - typical Seagate results". BC -- Using openSUSE 13.1, KDE 4.14.1 & kernel 3.16.3-1 on a system with- AMD FX 8-core 3.6/4.2GHz processor 16GB PC14900/1866MHz Quad Channel RAM Gigabyte AMD3+ m/board; Gigabyte nVidia GTX660 GPU -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (6)
-
Basil Chupin
-
Bruno Friedmann
-
Carlos E. R.
-
Greg Freemyer
-
Hannes Reinecke
-
Ludwig Nussel