[opensuse] thicking hard disk problem
Hello: I have opensuse 11.2 with kernel 2.6.31.14-0.4-desktop. The system has two 160 GB Maxtor hard disks which are linked to a Silicon Image 3114 PCI SATA (soft) raid controller. They are configured as RAID1 (mirror) devices and dmraid is set up and works well except from the symptom below. The problem is that occasionally one of the disks gives a ticking sound and this sound becomes frequent when the activity of the disks increase (eg. when copying from cd to disk). In /var/log/messages file there are several lines like these: Jan 14 23:43:17 linux kernel: [ 4468.814798] ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jan 14 23:43:17 linux kernel: [ 4468.814840] ata5: SError: { PHYRdyChg } Jan 14 23:43:17 linux kernel: [ 4468.814858] ata5.00: cmd c8/00:08:67:05:f4/00:00:00:00:00/e0 tag 0 dma 4096 in Jan 14 23:43:17 linux kernel: [ 4468.814860] res d0/d0:d0:d0:d0:d0/ff:ff:ff:ff:ff/c0 Emask 0x12 (ATA bus error) Jan 14 23:43:17 linux kernel: [ 4468.814878] ata5.00: status: { Busy } Jan 14 23:43:17 linux kernel: [ 4468.814887] ata5.00: error: { ICRC UNC IDNF } Jan 14 23:43:17 linux kernel: [ 4468.814904] ata5: hard resetting link These messages occur several times in the file. I guess they have to do something with the ticks. Accoding to dmesg ata5 is the Maxtor 6L160M0 disk.
dmesg|grep ata5 [ 1.925062] ata5: SATA max UDMA/100 mmio m1024@0xec007000 tf 0xec007080 irq 18 [ 2.230164] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 2.234193] ata5.00: ATA-7: Maxtor 6L160M0, BANC1G10, max UDMA/133 [ 2.234199] ata5.00: 320173056 sectors, multi 0: LBA48 NCQ (not used) [ 2.241186] ata5.00: configured for UDMA/100
I don't think that the hard disk is bad. I think this is RAID hardware or software issue. Either SiI3114 PCI card, even more likely SATA RAID driver issue. I googled and found several topics with similar error messages but could not find anything useful. How could I trace the origin of this symptom and check whether it is a hard disk hardware problem or driver problem? And possibly fix it in the latter case? Thanks, Istvan -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 1/14/2011 3:25 PM, Istvan Gabor wrote:
Hello:
I have opensuse 11.2 with kernel 2.6.31.14-0.4-desktop. The system has two 160 GB Maxtor hard disks which are linked to a Silicon Image 3114 PCI SATA (soft) raid controller. They are configured as RAID1 (mirror) devices and dmraid is set up and works well except from the symptom below.
The problem is that occasionally one of the disks gives a ticking sound and this sound becomes frequent when the activity of the disks increase (eg. when copying from cd to disk).
In /var/log/messages file there are several lines like these:
Jan 14 23:43:17 linux kernel: [ 4468.814798] ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jan 14 23:43:17 linux kernel: [ 4468.814840] ata5: SError: { PHYRdyChg } Jan 14 23:43:17 linux kernel: [ 4468.814858] ata5.00: cmd c8/00:08:67:05:f4/00:00:00:00:00/e0 tag 0 dma 4096 in Jan 14 23:43:17 linux kernel: [ 4468.814860] res d0/d0:d0:d0:d0:d0/ff:ff:ff:ff:ff/c0 Emask 0x12 (ATA bus error) Jan 14 23:43:17 linux kernel: [ 4468.814878] ata5.00: status: { Busy } Jan 14 23:43:17 linux kernel: [ 4468.814887] ata5.00: error: { ICRC UNC IDNF } Jan 14 23:43:17 linux kernel: [ 4468.814904] ata5: hard resetting link
These messages occur several times in the file. I guess they have to do something with the ticks.
I had one of these a year or two ago on software raid. Not good. Make sure you have a hot spare in your raid definition. Check for a drive running hot, loose cables, etc. Are you running smartd? grep smartd /var/log/messages and look for anything other than temperature changes. If you can determine which one is doing this, you might want to add a hot spare to your raid and then use what ever raid tools you have to fail the drive or remove it, at a time of your choosing and let the system rebuild, rather than have this happen when you least expect it. -- _____________________________________ ---This space for rent--- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 1/14/2011 3:25 PM, Istvan Gabor wrote:
Hello:
I have opensuse 11.2 with kernel 2.6.31.14-0.4-desktop. The system has two 160 GB Maxtor hard disks which are linked to a Silicon Image 3114 PCI SATA (soft) raid controller. They are configured as RAID1 (mirror) devices and dmraid is set up and works well except from the symptom below.
The problem is that occasionally one of the disks gives a ticking sound and this sound becomes frequent when the activity of the disks increase (eg. when copying from cd to disk).
In /var/log/messages file there are several lines like these:
Jan 14 23:43:17 linux kernel: [ 4468.814798] ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jan 14 23:43:17 linux kernel: [ 4468.814840] ata5: SError: { PHYRdyChg } Jan 14 23:43:17 linux kernel: [ 4468.814858] ata5.00: cmd c8/00:08:67:05:f4/00:00:00:00:00/e0 tag 0 dma 4096 in Jan 14 23:43:17 linux kernel: [ 4468.814860] res d0/d0:d0:d0:d0:d0/ff:ff:ff:ff:ff/c0 Emask 0x12 (ATA bus error) Jan 14 23:43:17 linux kernel: [ 4468.814878] ata5.00: status: { Busy } Jan 14 23:43:17 linux kernel: [ 4468.814887] ata5.00: error: { ICRC UNC IDNF } Jan 14 23:43:17 linux kernel: [ 4468.814904] ata5: hard resetting link
These messages occur several times in the file. I guess they have to do something with the ticks.
I had one of these a year or two ago on software raid. Not good. Make sure you have a hot spare in your raid definition.
Check for a drive running hot, loose cables, etc. Are you running smartd? grep smartd /var/log/messages and look for anything other than temperature changes.
If you can determine which one is doing this, you might want to add a hot spare to your raid and then use what ever raid tools you have to fail the drive or remove it, at a time of your choosing and let the system rebuild, rather than have this happen when you least expect it. That sounds very much like a dying drive to me back it up and can in and if
On Saturday 15 January 2011 00:23:47 John Andersen wrote: this is an SATA drive make sure you replace that data cables with the ones that lock firmly into place i a a 1Tb SATA drive here that is scrap because the data cable got skwed over and fried the interface .. Samsung wont supply as new borad so i dont buy samsung devices again simples. Pete . -- Powered by openSUSE 11.3 (x86_64) Kernel: 2.6.34.7-0.7-desktop KDE Development Platform: 4.4.4 (KDE 4.4.4) "release 3" 08:15 up 2 days 14:22, 4 users, load average: 0.01, 0.02, 0.00 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Peter Nikolic wrote:
On Saturday 15 January 2011 00:23:47 John Andersen wrote:
On 1/14/2011 3:25 PM, Istvan Gabor wrote:
I have opensuse 11.2 with kernel 2.6.31.14-0.4-desktop. The system has two 160 GB Maxtor hard disks which are linked to a Silicon Image 3114 PCI SATA (soft) raid controller. They are configured as RAID1 (mirror) devices and dmraid is set up and works well except from the symptom below.
The problem is that occasionally one of the disks gives a ticking sound and this sound becomes frequent when the activity of the disks increase (eg. when copying from cd to disk).
Has the system ever worked perfectly or has it always shown these symptoms in this configuration?
In /var/log/messages file there are several lines like these:
Jan 14 23:43:17 linux kernel: [ 4468.814798] ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jan 14 23:43:17 linux kernel: [ 4468.814840] ata5: SError: { PHYRdyChg } Jan 14 23:43:17 linux kernel: [ 4468.814858] ata5.00: cmd c8/00:08:67:05:f4/00:00:00:00:00/e0 tag 0 dma 4096 in Jan 14 23:43:17 linux kernel: [ 4468.814860] res d0/d0:d0:d0:d0:d0/ff:ff:ff:ff:ff/c0 Emask 0x12 (ATA bus error) Jan 14 23:43:17 linux kernel: [ 4468.814878] ata5.00: status: { Busy } Jan 14 23:43:17 linux kernel: [ 4468.814887] ata5.00: error: { ICRC UNC IDNF } Jan 14 23:43:17 linux kernel: [ 4468.814904] ata5: hard resetting link
There's an explanation of the error messages at https://ata.wiki.kernel.org/index.php/Libata_error_messages that might help. But I have to say that I'm pretty much still as mystified as I was even after reading that page. Does anybody know of a better ( == more idiot proof) explanation?
These messages occur several times in the file. I guess they have to do something with the ticks.
Almost certainly. I don't know whether it is a disk or a problem with the controller or your system setup (e.g. power supply). I have a system where the disks are shown as healthy by smart and pass all manufacturers' tests but show similar symptoms when attached to that particular system. What does smart say about your disks? Run smart -t long <device> And then after it has finished run smart -a <device> to see what the result was.
I had one of these a year or two ago on software raid. Not good. Make sure you have a hot spare in your raid definition.
That sounds very much like a dying drive to me back it up
These comments sound like good advice! Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
2011. január 17. 11:16 napon Dave Howorth
Peter Nikolic wrote:
On Saturday 15 January 2011 00:23:47 John Andersen wrote:
On 1/14/2011 3:25 PM, Istvan Gabor wrote:
I have opensuse 11.2 with kernel 2.6.31.14-0.4-desktop. The system has two 160 GB Maxtor hard disks which are linked to a Silicon Image 3114 PCI SATA (soft) raid controller. They are configured as RAID1 (mirror) devices and dmraid is set up and works well except from the symptom below.
The problem is that occasionally one of the disks gives a ticking sound and this sound becomes frequent when the activity of the disks increase (eg. when copying from cd to disk).
Has the system ever worked perfectly or has it always shown these symptoms in this configuration?
In /var/log/messages file there are several lines like these:
Jan 14 23:43:17 linux kernel: [ 4468.814798] ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen Jan 14 23:43:17 linux kernel: [ 4468.814840] ata5: SError: { PHYRdyChg } Jan 14 23:43:17 linux kernel: [ 4468.814858] ata5.00: cmd c8/00:08:67:05:f4/00:00:00:00:00/e0 tag 0 dma 4096 in Jan 14 23:43:17 linux kernel: [ 4468.814860] res d0/d0:d0:d0:d0:d0/ff:ff:ff:ff:ff/c0 Emask 0x12 (ATA bus error) Jan 14 23:43:17 linux kernel: [ 4468.814878] ata5.00: status: { Busy } Jan 14 23:43:17 linux kernel: [ 4468.814887] ata5.00: error: { ICRC UNC IDNF } Jan 14 23:43:17 linux kernel: [ 4468.814904] ata5: hard resetting link
There's an explanation of the error messages at https://ata.wiki.kernel.org/index.php/Libata_error_messages that might help.
But I have to say that I'm pretty much still as mystified as I was even after reading that page. Does anybody know of a better ( == more idiot proof) explanation?
These messages occur several times in the file. I guess they have to do something with the ticks.
Almost certainly.
I don't know whether it is a disk or a problem with the controller or your system setup (e.g. power supply). I have a system where the disks are shown as healthy by smart and pass all manufacturers' tests but show similar symptoms when attached to that particular system.
What does smart say about your disks?
Run smart -t long <device> And then after it has finished run smart -a <device> to see what the result was.
I had one of these a year or two ago on software raid. Not good. Make sure you have a hot spare in your raid definition.
That sounds very much like a dying drive to me back it up
These comments sound like good advice!
Cheers, Dave
First thank you Dave, Pete and John for your help, second I apologize for the late response. In the meantime I removed the hard disk in question from the system and inserted it into another computer (not as a RAID device, just as a normal SATA disk). In that other system the drive operates with no problem, there are no ticks, copying to the drive goes with up to 40-60 Mbit/sec without hanging, and there are no kernel error messages. So it is/was either a controller problem, or a driver problem I guess. Should I try another controller with a different chipset? I will run smart test later, it's too late now. Thanks again, Istvan -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 01/20/2011 05:38 PM, Istvan Gabor wrote:
First thank you Dave, Pete and John for your help, second I apologize for the late response.
In the meantime I removed the hard disk in question from the system and inserted it into another computer (not as a RAID device, just as a normal SATA disk). In that other system the drive operates with no problem, there are no ticks, copying to the drive goes with up to 40-60 Mbit/sec without hanging, and there are no kernel error messages.
So it is/was either a controller problem, or a driver problem I guess.
Should I try another controller with a different chipset?
I will run smart test later, it's too late now.
Thanks again,
Istvan
Istvan, I have seen similar behavior with seagate drives in dmraid arrays. I don't have an answer as to why this occurs, but I always felt that it had something to do with problems handling bad-block reallocation while the disks were in arrays. I know that is handled at the drive level and shouldn't matter, but.... I have split arrays into single disks when this occurred, fscked, run bad-blocks and rebuilt the arrays and have had them run for another month or so before the same issues occurred. I still have the same drives running non-raided with no issues at all 2 years later. Another issue that will cause dmraid to desync is using 'savedefault' in grub to provide failover to another kernel or OS in the event of a failed boot. If you have changed 'default #' to 'default saved' and then added 'savedefault' or 'savedefault #' to your boot entries in grub, get rid of them and go back to 'default #' and see if that doesn't help. dmraid has been solid for me for at least 8 years, but when this type of issue pops up, you realize how much voodoo it relies on that makes tracking down the exact issue difficult. The dmraid mailing list is 'dm-devel@redhat.com'. The list is low-volume (1-2 posts per day), so it wouldn't hurt to subscribe and run the issue by the devs to see if they may not have a better answer. -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Thu, Jan 20, 2011 at 7:10 PM, David C. Rankin
On 01/20/2011 05:38 PM, Istvan Gabor wrote:
First thank you Dave, Pete and John for your help, second I apologize for the late response.
In the meantime I removed the hard disk in question from the system and inserted it into another computer (not as a RAID device, just as a normal SATA disk). In that other system the drive operates with no problem, there are no ticks, copying to the drive goes with up to 40-60 Mbit/sec without hanging, and there are no kernel error messages.
So it is/was either a controller problem, or a driver problem I guess.
Should I try another controller with a different chipset?
I will run smart test later, it's too late now.
Thanks again,
Istvan
Istvan,
I have seen similar behavior with seagate drives in dmraid arrays. I don't have an answer as to why this occurs, but I always felt that it had something to do with problems handling bad-block reallocation while the disks were in arrays. I know that is handled at the drive level and shouldn't matter, but.... I have split arrays into single disks when this occurred, fscked, run bad-blocks and rebuilt the arrays and have had them run for another month or so before the same issues occurred. I still have the same drives running non-raided with no issues at all 2 years later.
Another issue that will cause dmraid to desync is using 'savedefault' in grub to provide failover to another kernel or OS in the event of a failed boot. If you have changed 'default #' to 'default saved' and then added 'savedefault' or 'savedefault #' to your boot entries in grub, get rid of them and go back to 'default #' and see if that doesn't help.
dmraid has been solid for me for at least 8 years, but when this type of issue pops up, you realize how much voodoo it relies on that makes tracking down the exact issue difficult. The dmraid mailing list is 'dm-devel@redhat.com'. The list is low-volume (1-2 posts per day), so it wouldn't hurt to subscribe and run the issue by the devs to see if they may not have a better answer.
The Linux RAID list
participants (6)
-
Dave Howorth
-
David C. Rankin
-
Greg Freemyer
-
Istvan Gabor
-
John Andersen
-
Peter Nikolic