[opensuse] how to diagnose disk-related problem?
I have a problem on a new server that I'm bedding in. I'm trying to copy data onto it from an older machine and the new machine is going into a strange 'locked-in' state. I'm looking for suggestions on the best way to investigate the problem. On an old server, there's some data that I want to copy to the new server. The way I'm doing that is to nfs export the data on the old server, nfs mount it on the new server and then use cp to copy it to its new home, something like this: cp -uax /nfs/old-host/data /new-directory I actually have two of these running, for different subsets of the data. BTW, I'm executing all these commands via ssh from my desktop machine. That works fine for a couple of hours and then when I try to execute another command (ls) in another ssh shell it tells me that the new server has closed the connection. The two cp commands are still apparently running but when I try typing ^Z, those ssh sessions are terminated as well. If I go to the new server's screen, it is blank and I haven't found any key sequence that makes anything appear on it (specifically, CTRL-ALT-F1 doesn't, for example). So it's beginning to sound like it's crashed, yes? But it hasn't. The disk activity lights for the drives where /new-directory lives are still flashing and running top on the old server shows me that a couple of nfsd processes are busy and that system is spending a lot of time accessing its disks (there's not much else running) so it seems like the cp processes are still running. At this point I don't see any alternative to rebooting but I thought I'd ask first to see if anybody had any other ideas on ways to gather information? These symptoms have occurred once before. When I rebooted that time, the system came up without one of the data disks - the system said the interface wasn't responding. But when I unplugged the disk and plugged it back in and rebooted, it came back. There was actually a lot else happened at that time, so I can't be sure it was as simple as it sounds. I'll see if the same happens this time. I wasn't able to spot anything useful in the logs and smart said the disk was OK, AFAICT. The new data directory is in an ext4 filesystem in an LVM volume on an mdadm RAID10 all on openSUSE 11.2 so another possibility is some kind of problem in one of those systems. The drive is in a SATA hot swap cage connected to a port on the motherboard; it's a WD RE4 1.5 TB. All ideas gratefully received! Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Dave Howorth wrote:
I have a problem on a new server that I'm bedding in. I'm trying to copy data onto it from an older machine and the new machine is going into a strange 'locked-in' state. I'm looking for suggestions on the best way to investigate the problem.
On an old server, there's some data that I want to copy to the new server. The way I'm doing that is to nfs export the data on the old server, nfs mount it on the new server and then use cp to copy it to its new home, something like this:
cp -uax /nfs/old-host/data /new-directory
Personally, I always use rsync for that sort of thing. Key feature being that it is restartable. -- Per Jessen, Zürich (25.8°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Per Jessen wrote:
Dave Howorth wrote:
cp -uax /nfs/old-host/data /new-directory
Personally, I always use rsync for that sort of thing. Key feature being that it is restartable.
Indeed but unfortunately in this case it doesn't work. It runs out of memory on the old server before completing preparation of the file list. And I don't want to mess around with the config on the old [production] server - I want to replace it. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 07/22/2010 08:29 AM, Dave Howorth pecked at the keyboard and wrote:
Per Jessen wrote:
Dave Howorth wrote:
cp -uax /nfs/old-host/data /new-directory
Personally, I always use rsync for that sort of thing. Key feature being that it is restartable.
Indeed but unfortunately in this case it doesn't work. It runs out of memory on the old server before completing preparation of the file list. And I don't want to mess around with the config on the old [production] server - I want to replace it.
Cheers, Dave
Did you try pulling the data to the new machine by running the command from the new one? -- Ken Schneider SuSe since Version 5.2, June 1998 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Ken Schneider - openSUSE wrote:
On 07/22/2010 08:29 AM, Dave Howorth pecked at the keyboard and wrote:
Per Jessen wrote:
Dave Howorth wrote:
cp -uax /nfs/old-host/data /new-directory Personally, I always use rsync for that sort of thing. Key feature being that it is restartable. Indeed but unfortunately in this case it doesn't work. It runs out of memory on the old server before completing preparation of the file list. And I don't want to mess around with the config on the old [production] server - I want to replace it.
Cheers, Dave
Did you try pulling the data to the new machine by running the command from the new one?
Yes that's what I was doing. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Dave Howorth wrote:
Per Jessen wrote:
Dave Howorth wrote:
cp -uax /nfs/old-host/data /new-directory
Personally, I always use rsync for that sort of thing. Key feature being that it is restartable.
Indeed but unfortunately in this case it doesn't work. It runs out of memory on the old server before completing preparation of the file list.
Interesting problem.
And I don't want to mess around with the config on the old [production] server - I want to replace it.
Understand. -- Per Jessen, Zürich (24.1°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Dave Howorth wrote:
So it's beginning to sound like it's crashed, yes? But it hasn't. The disk activity lights for the drives where /new-directory lives are still flashing and running top on the old server shows me that a couple of nfsd processes are busy and that system is spending a lot of time accessing its disks (there's not much else running) so it seems like the cp processes are still running.
Hmm, and you are sure that both disksets are OK and don't have any bad sectors? What filesystems are you using?
At this point I don't see any alternative to rebooting but I thought I'd ask first to see if anybody had any other ideas on ways to gather information?
Difficult if you cannot access it anymore. I'm normally logging to a remote machine so I can always see syslog output.
These symptoms have occurred once before. When I rebooted that time, the system came up without one of the data disks - the system said the interface wasn't responding. But when I unplugged the disk and plugged it back in and rebooted, it came back. There was actually a lot else happened at that time, so I can't be sure it was as simple as it sounds. I'll see if the same happens this time. I wasn't able to spot anything useful in the logs and smart said the disk was OK, AFAICT.
Is that a RAID? We sometimes have trouble with bad connections, but that is mostly due to very low humidity in our environment. I often need to re-seat PCI cards etc because they vanish.
The new data directory is in an ext4 filesystem in an LVM volume on an mdadm RAID10 all on openSUSE 11.2 so another possibility is some kind of problem in one of those systems. The drive is in a SATA hot swap cage connected to a port on the motherboard; it's a WD RE4 1.5 TB.
Ah, there it goes. Hmm, running out of resources? If the new disks are still flashing they are still writing, so the buffer might be over-full and everything else is swapped out? Could of course also be a network card problem.... Out of ideas now Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Peter Suetterlin wrote:
Dave Howorth wrote:
So it's beginning to sound like it's crashed, yes? But it hasn't. The disk activity lights for the drives where /new-directory lives are still flashing and running top on the old server shows me that a couple of nfsd processes are busy and that system is spending a lot of time accessing its disks (there's not much else running) so it seems like the cp processes are still running.
Hmm, and you are sure that both disksets are OK and don't have any bad sectors? What filesystems are you using?
I made a mistake when I said the new filesystem is ext4; it's actually xfs. The old server is reiserfs but I don't think there's any problem on that machine. There definitely seems to be some kind of problem on the new machine, most likely associated with a drive or motherboard port.
At this point I don't see any alternative to rebooting but I thought I'd ask first to see if anybody had any other ideas on ways to gather information?
Difficult if you cannot access it anymore. I'm normally logging to a remote machine so I can always see syslog output.
That's a good idea. I'll set that up before I run my next experiment. Thanks. At the moment suse is on another partition on the drive that fails. I might move suse to a different drive to separate things. I did reboot and found the same symptoms as before - one specific drive not responding. After a powercycle, it's OK again. I've swapped it with the drive next to it to see whether the problem follows the drive or stays with the bay/controller port. At the moment it's still rebuilding the RAID.
Ah, there it goes. Hmm, running out of resources? If the new disks are still flashing they are still writing, so the buffer might be over-full and everything else is swapped out?
Hmm, I'll be sure watch the resource usage more carefully when I restart the transfer. Thanks again.
Could of course also be a network card problem....
Maybe, but I don't see how that could produce the temporarily failed drive symptom.
Out of ideas now
Thanks for the ones you did have! Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 2010-07-22 I wrote:
There definitely seems to be some kind of problem on the new machine, most likely associated with a drive or motherboard port.
Difficult if you cannot access it anymore. I'm normally logging to a remote machine so I can always see syslog output.
That's a good idea. I'll set that up before I run my next experiment. Thanks. At the moment suse is on another partition on the drive that fails. I might move suse to a different drive to separate things.
I did reboot and found the same symptoms as before - one specific drive not responding. After a powercycle, it's OK again. I've swapped it with the drive next to it to see whether the problem follows the drive or stays with the bay/controller port.
So ... here I am again :( You may remember I had an intermittent disk failure. It's been alright since then but it finally failed again and this time I managed to capture an error log. I had had a problem on a particular disk, which is part of a RAID. I put the disk back in a different bay in the disk cage (I swapped it with one of the other disks in the RAID) in order to see whether it was the disk or the port that had the problem. I let it reintegrate the RAID and then I set it back to the task of loading all my data onto it. It's been doing that solidly for the past two plus weeks but has now crashed again. I also set up a network log, as Pit suggested and that has come up trumps! The problem has moved with the disk. The last bit of the log is below; does it mean anything to anyone? I'll post more details tomorrow. Cheers, Dave Aug 7 02:36:13 scop4 kernel: [1341328.180960] ata4.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x6 frozen Aug 7 02:36:13 scop4 kernel: [1341328.180980] ata4.00: cmd 61/08:00:2b:00:dd/00:00:16:00:00/40 tag 0 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.180982] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.180994] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181004] ata4.00: cmd 61/08:08:e3:4b:dd/00:00:16:00:00/40 tag 1 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181006] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181017] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181026] ata4.00: cmd 61/08:10:cb:c0:8e/00:00:48:00:00/40 tag 2 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181028] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181039] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181048] ata4.00: cmd 61/08:18:e7:08:85/00:00:00:00:00/40 tag 3 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181050] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181061] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181070] ata4.00: cmd 61/08:20:ef:08:85/00:00:00:00:00/40 tag 4 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181072] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181083] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181091] ata4: hard resetting link Aug 7 02:36:19 scop4 kernel: [1341333.536388] ata4: link is slow to respond, please be patient (ready=0) Aug 7 02:36:23 scop4 kernel: [1341338.228803] ata4: COMRESET failed (errno=-16) Aug 7 02:36:23 scop4 kernel: [1341338.228817] ata4: hard resetting link Aug 7 02:36:29 scop4 kernel: [1341343.583285] ata4: link is slow to respond, please be patient (ready=0) Aug 7 02:36:33 scop4 kernel: [1341348.275690] ata4: COMRESET failed (errno=-16) Aug 7 02:36:33 scop4 kernel: [1341348.275704] ata4: hard resetting link Aug 7 02:36:39 scop4 kernel: [1341353.630304] ata4: link is slow to respond, please be patient (ready=0) Aug 7 02:37:08 scop4 kernel: [1341383.316777] ata4: COMRESET failed (errno=-16) Aug 7 02:37:08 scop4 kernel: [1341383.316791] ata4: limiting SATA link speed to 1.5 Gbps Aug 7 02:37:08 scop4 kernel: [1341383.316798] ata4: hard resetting link Aug 7 02:37:13 scop4 kernel: [1341388.366208] ata4: COMRESET failed (errno=-16) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 2010-08-09 18:23, Dave Howorth wrote:
I also set up a network log, as Pit suggested and that has come up trumps! The problem has moved with the disk. The last bit of the log is below; does it mean anything to anyone?
Not to me. However, you know that there are specific disc test utilities out there. For example, Seagate has one on there site, a small CD you can download and boot with (it as a floppy before). You boot and test the disk with it. It has to be good, because if it says a disk is bad, you print a form directly and send it back on warranty. Basically, it can do the same as the internal SMART tests, but from outside. Plus, the cable and interface. I suppose other manufacturers have their own utilities. -- Cheers / Saludos, Carlos E. R. (from 11.2 x86_64 "Emerald" GM (Minas Tirith))
HI, On Mon, Aug 9, 2010 at 9:14 PM, Carlos E. R. <robin.listas@gmail.com> wrote:
On 2010-08-09 18:23, Dave Howorth wrote:
I also set up a network log, as Pit suggested and that has come up trumps! The problem has moved with the disk. The last bit of the log is below; does it mean anything to anyone?
Not to me.
However, you know that there are specific disc test utilities out there. For example, Seagate has one on there site, a small CD you can download and boot with (it as a floppy before). You boot and test the disk with it. It has to be good, because if it says a disk is bad, you print a form directly and send it back on warranty.
I had some interesting experience with vendor test utility. (I'm not saying one should not trust them, just keep in mind, there is nothing perfect). It was about 5 years ago or even more. I had Maxtor 60 GB HD with Windows-2000 (rarely used) and SuSE. One day while in Linux I noticed some unexpected disk activity (It was ext3 FS). The system locked up. After reboot GRUB could not boot Linux (and could not find its menu). But using its command line I could boot Windows. Disk test indicates there are multiple defects in Linux partition. I then downloaded Maxtor diagnostics disk (it was floppy then) and their test failed twice and indicated the disk has to be replaced. Of course I was terribly upset and decided to use some tricks learned in old MS-DOS times (good old Norton Disk Editor) and fsck to recover some important data. After I more or less succeeded and had nothing more to lose, I decided to run fsck in auto-repaire mode (that was even pre-MS era experience that sometimes multiple re-writing of disk block recovers it). It ran about 2 days... After that I ran vendor test again - disk was found OK. I did it a number of times - no problems. So I decided to try and re-install Linux. Guess what? This disk is still working and it was on my main desktop for 5 years and I've not seen any failures. Regards, -- Mark Goldstein -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 2010-08-09 21:00, Mark Goldstein wrote:
HI,
On Mon, Aug 9, 2010 at 9:14 PM, Carlos E. R. <> wrote:
After that I ran vendor test again - disk was found OK. I did it a number of times - no problems. So I decided to try and re-install Linux. Guess what? This disk is still working and it was on my main desktop for 5 years and I've not seen any failures.
I know, I have a very similar experience. The disk is over 20Kh old now and still works. It failed at about a thousand hours. The reason is simply that the sectors were remapped, and no more bad sectors have appeared. You just use it with some caution. As everything. However, if the utility prints a return on warranty form and you want to be cautious, you can go ahead. But no, the disks I return have much more serious failures, catastrophic. Not a few harmless bad sectors :-) The OP problem doesn't seem to be of that kind. -- Cheers / Saludos, Carlos E. R. (from 11.2 x86_64 "Emerald" GM (Elessar))
Hi Dave, Dave Howorth wrote:
I also set up a network log, as Pit suggested and that has come up trumps! The problem has moved with the disk. The last bit of the log is below; does it mean anything to anyone?
Not really - but somehow it looks as if the communication with the drive is failing. Are all disks in the set the same and approximately of the same age? I recall having trouble with WD disks when combining older and newer 750GB disks. Somehow they were very different in typical response time and this could confuse the controller (or driver?). Symptoms were either drives not being recognized at all or dropouts after some days/weeks of use.... Pit
Aug 7 02:36:13 scop4 kernel: [1341328.180960] ata4.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x6 frozen Aug 7 02:36:13 scop4 kernel: [1341328.180980] ata4.00: cmd 61/08:00:2b:00:dd/00:00:16:00:00/40 tag 0 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.180982] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.180994] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181004] ata4.00: cmd 61/08:08:e3:4b:dd/00:00:16:00:00/40 tag 1 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181006] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181017] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181026] ata4.00: cmd 61/08:10:cb:c0:8e/00:00:48:00:00/40 tag 2 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181028] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181039] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181048] ata4.00: cmd 61/08:18:e7:08:85/00:00:00:00:00/40 tag 3 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181050] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181061] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181070] ata4.00: cmd 61/08:20:ef:08:85/00:00:00:00:00/40 tag 4 ncq 4096 out Aug 7 02:36:13 scop4 kernel: [1341328.181072] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 7 02:36:13 scop4 kernel: [1341328.181083] ata4.00: status: { DRDY } Aug 7 02:36:13 scop4 kernel: [1341328.181091] ata4: hard resetting link Aug 7 02:36:19 scop4 kernel: [1341333.536388] ata4: link is slow to respond, please be patient (ready=0) Aug 7 02:36:23 scop4 kernel: [1341338.228803] ata4: COMRESET failed (errno=-16) Aug 7 02:36:23 scop4 kernel: [1341338.228817] ata4: hard resetting link Aug 7 02:36:29 scop4 kernel: [1341343.583285] ata4: link is slow to respond, please be patient (ready=0) Aug 7 02:36:33 scop4 kernel: [1341348.275690] ata4: COMRESET failed (errno=-16) Aug 7 02:36:33 scop4 kernel: [1341348.275704] ata4: hard resetting link Aug 7 02:36:39 scop4 kernel: [1341353.630304] ata4: link is slow to respond, please be patient (ready=0)
Aug 7 02:37:08 scop4 kernel: [1341383.316777] ata4: COMRESET failed (errno=-16) Aug 7 02:37:08 scop4 kernel: [1341383.316791] ata4: limiting SATA link speed to 1.5 Gbps Aug 7 02:37:08 scop4 kernel: [1341383.316798] ata4: hard resetting link Aug 7 02:37:13 scop4 kernel: [1341388.366208] ata4: COMRESET failed (errno=-16) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Peter Suetterlin wrote:
Not really - but somehow it looks as if the communication with the drive is failing.
Yes, that's what it looks like to me too.
Are all disks in the set the same and approximately of the same age?
No and yes. It's a new machine with new disks. Half are Seagate and half are WD just in case there is a manufacturer-specific batch-related problem.
I recall having trouble with WD disks when combining older and newer 750GB disks. Somehow they were very different in typical response time and this could confuse the controller (or driver?). Symptoms were either drives not being recognized at all or dropouts after some days/weeks of use....
I know there can be problems with TLER with WD drives because they've deliberately broken the ATA8 spec to try to get you to buy their expensive drives instead of their cheap ones (they succeeded in this case, by accident, so I don't have that issue). But the symptoms of that are the drive dropping out of the RAID whereas this is a complete freeze, which I think is different. My problem is compounded by the inconvenience that suse is on another partition on that drive, so when it freezes the whole system goes down. I think what I'm going to do is: (1) run the WD diagnostic - if it says the disk is faulty then game over, otherwise: (2) move suse to a different disk so (a) hopefully the system doesn't fall over if there's a problem and (b) it may change the symptoms. (3) just buy another (non-WD) disk and see if the problem goes away. (if I have to pay three times as much for a WD drive, I don't expect obscure problems like this). At present I still don't know whether the problem is a flaky disk, a flaky motherboard, or just possibly a kernel/driver bug. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (6)
-
Carlos E. R.
-
Dave Howorth
-
Ken Schneider - openSUSE
-
Mark Goldstein
-
Per Jessen
-
Peter Suetterlin