I've got a test box running SUSE 10.0 that has hung twice in the last 24 hours. How should I try to troubleshoot this? The first time I rebooted and found nothing in the log. This is the second time and it is still hung. (hung == no network activity, no ping response, no keyboard response). It is configured like a fileserver so it has ECC ram and hardware mirrored disks (3ware 2-port cards). I have been lightly using the machine for a few weeks, but recently a couple of us have been doing some Ruby on Rails development. As part of that we setup FreeNX. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
On Wed, 2006-01-04 at 19:40 -0500, Greg Freemyer wrote:
I've got a test box running SUSE 10.0 that has hung twice in the last 24 hours.
How should I try to troubleshoot this? The first time I rebooted and found nothing in the log.
This is the second time and it is still hung. (hung == no network activity, no ping response, no keyboard response).
It is configured like a fileserver so it has ECC ram and hardware mirrored disks (3ware 2-port cards).
I have been lightly using the machine for a few weeks, but recently a couple of us have been doing some Ruby on Rails development. As part of that we setup FreeNX.
Next time you leave the PC leave it in tty10 (ctrl-alt-f10) when the errors go and maybe something will show there. Even if the machine is frozen there should still something on the screen. Just a thought. -- Ken Schneider UNIX since 1989, linux since 1994, SuSE since 1998
On 1/4/06, Ken Schneider <suse-list@bout-tyme.net> wrote:
On Wed, 2006-01-04 at 19:40 -0500, Greg Freemyer wrote:
I've got a test box running SUSE 10.0 that has hung twice in the last 24 hours.
How should I try to troubleshoot this? The first time I rebooted and found nothing in the log.
This is the second time and it is still hung. (hung == no network activity, no ping response, no keyboard response).
It is configured like a fileserver so it has ECC ram and hardware mirrored disks (3ware 2-port cards).
I have been lightly using the machine for a few weeks, but recently a couple of us have been doing some Ruby on Rails development. As part of that we setup FreeNX.
Next time you leave the PC leave it in tty10 (ctrl-alt-f10) when the errors go and maybe something will show there. Even if the machine is frozen there should still something on the screen. Just a thought.
When it hangs nothing new display on tty10. OTOH, the old info is still displayed. It is now locking up every few minutes so I'm not going to look into the magic-sysrq thing. (I've never used that, but I seen it recommended on some of the LKML lists.) Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
On Wednesday 04 January 2006 4:40 pm, Greg Freemyer wrote:
I've got a test box running SUSE 10.0 that has hung twice in the last 24 hours.
How should I try to troubleshoot this? The first time I rebooted and found nothing in the log.
This is the second time and it is still hung. (hung == no network activity, no ping response, no keyboard response).
It is configured like a fileserver so it has ECC ram and hardware mirrored disks (3ware 2-port cards).
I have been lightly using the machine for a few weeks, but recently a couple of us have been doing some Ruby on Rails development. As part of that we setup FreeNX.
First thing I'd do is run memtest86 on it. Everytime I've experienced lockups like you describe, it's been a memory issue. Scott -- POPFile, the OpenSource EMail Classifier http://popfile.sourceforge.net/ Linux 2.6.11.4-21.10-default x86_64 SuSE Linux 9.3 (x86-64)
On Wed, 2006-01-04 at 18:28, Scott Leighton wrote:
On Wednesday 04 January 2006 4:40 pm, Greg Freemyer wrote:
I've got a test box running SUSE 10.0 that has hung twice in the last 24 hours.
How should I try to troubleshoot this? The first time I rebooted and found nothing in the log.
This is the second time and it is still hung. (hung == no network activity, no ping response, no keyboard response).
It is configured like a fileserver so it has ECC ram and hardware mirrored disks (3ware 2-port cards).
I have been lightly using the machine for a few weeks, but recently a couple of us have been doing some Ruby on Rails development. As part of that we setup FreeNX.
First thing I'd do is run memtest86 on it. Everytime I've experienced lockups like you describe, it's been a memory issue. Another possibility is CPU overheating. Be sure all your fans are running properly. -- Jim Cunning <jcunning@cunning.ods.org>
On 1/5/06, Jim Cunning <jcunning@cunning.ods.org> wrote:
On Wed, 2006-01-04 at 18:28, Scott Leighton wrote:
I believe in fans. All 9 are running. Are there any generic hardware diagnostics other than memtest86 I can try? I have spare everything (except possibly RAM), but I hate to have to just start randomly swapping hardware. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
have u tried www.memtest.org. you can make urself a bootable cd to test ur memory from bootup. On Thursday 05 January 2006 21:13, Greg Freemyer wrote:
On 1/5/06, Jim Cunning <jcunning@cunning.ods.org> wrote:
On Wed, 2006-01-04 at 18:28, Scott Leighton wrote:
I believe in fans. All 9 are running.
Are there any generic hardware diagnostics other than memtest86 I can try?
I have spare everything (except possibly RAM), but I hate to have to just start randomly swapping hardware.
Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
Still hanging, but just got a 30 minute run from boot to hang!!! The longest of the day. Current status: Memtest86 from SUSE 10 DVD shows nothing wrong. 6 passes completed Have already swapped out RAM and CPU with another machine. I found a cheap ($10) bootable diagnostic (TuffTest V1.53) and have tried to run it. It locks up, but behaves the same on another server I have with the same MB/CPU/memory. (My tape server which is not used during the day.) My next thought is to swap the motherboard or the powersupply. Maybe both since I have a space chassis with a MB and powersupply in it. Other diagnostic suggestions welcome. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
On Thu, 2006-01-05 at 18:10 -0500, Greg Freemyer wrote:
Still hanging, but just got a 30 minute run from boot to hang!!! The longest of the day.
Current status: Memtest86 from SUSE 10 DVD shows nothing wrong. 6 passes completed Have already swapped out RAM and CPU with another machine.
I found a cheap ($10) bootable diagnostic (TuffTest V1.53) and have tried to run it. It locks up, but behaves the same on another server I have with the same MB/CPU/memory. (My tape server which is not used during the day.)
My next thought is to swap the motherboard or the power supply. Maybe both since I have a space chassis with a MB and power supply in it.
Other diagnostic suggestions welcome.
You mentioned in an earlier email about have 9 fans, IIRC. Do you have enough of a power supply to run all of them? Could be a source of problems. I have a PC that had a power supply fail and the symptoms were very close to what you describe here. Things slowly starting to fail and hang. -- Ken Schneider UNIX since 1989, linux since 1994, SuSE since 1998
On 1/5/06, Ken Schneider <suse-list@bout-tyme.net> wrote:
You mentioned in an earlier email about have 9 fans, IIRC. Do you have enough of a power supply to run all of them? Could be a source of problems. I have a PC that had a power supply fail and the symptoms were very close to what you describe here. Things slowly starting to fail and hang.
That is a real possibility. ie. The power supply was spec'ed to be big enough and it had been working but maybe it is failing. Once I get my devel setup stable I will go back and troubleshoot the old setup. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
On Thursday 05 January 2006 18:10, Greg Freemyer wrote:
Other diagnostic suggestions welcome.
Hi Greg, Seems to me you've got things pretty well covered, especially with all those spares you have available for swapping out. One more idea: If you check the BIOS, sometimes they have a range of 'default' configurations built in like "sedate; works with junk hardware" to "stable with brand name parts" to "you're nuts if you try this" ... sometimes you can step the memory CAS setting down from 2 or 2.5 to 3 or 3.5. Those RAM tests are great for detecting outright faults and even some marginal regions in the chips and PCBs, but they don't even come close to simulating all the noise generated when the real OS is running. Also, I seem to recall writing a fairly lengthy explanation here some weeks ago about why you cannot skimp on memory and why you're better off, especially with high end machines, paying the premium for name brand memory carrying a no questions asked 24 hour advance replacement warranty. Those pieces have usually been properly stress tested at the factory to confirm tight adherence to the stated performance characteristics. They're also usually labeled and sold in true matched sets. regards, - Carl
On 1/5/06, Carl Hartung <suselinux@cehartung.com> wrote:
On Thursday 05 January 2006 18:10, Greg Freemyer wrote:
Other diagnostic suggestions welcome.
Hi Greg,
Seems to me you've got things pretty well covered, especially with all those spares you have available for swapping out.
One more idea: If you check the BIOS, sometimes they have a range of 'default' configurations built in like "sedate; works with junk hardware" to "stable with brand name parts" to "you're nuts if you try this" ... sometimes you can step the memory CAS setting down from 2 or 2.5 to 3 or 3.5.
Those RAM tests are great for detecting outright faults and even some marginal regions in the chips and PCBs, but they don't even come close to simulating all the noise generated when the real OS is running.
Also, I seem to recall writing a fairly lengthy explanation here some weeks ago about why you cannot skimp on memory and why you're better off, especially with high end machines, paying the premium for name brand memory carrying a no questions asked 24 hour advance replacement warranty. Those pieces have usually been properly stress tested at the factory to confirm tight adherence to the stated performance characteristics. They're also usually labeled and sold in true matched sets.
regards,
- Carl
Thanks, I just finished moving my disk drives / controllers (3ware) to the new chassis. Everything else is new/different. Unfortunately it turns out to be a desktop MB I had, not a server MB. (ie. no ECC RAM on the new MB.) OTOH the spare chassis has redundant power-supplies, so I should be good to go for power. This chassis/power supply was fairly expensive so I can't use it for devel/test, but once I have a know good setup I can start swapping in the other set of components one at a time until this machine fails. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century
Greg Freemyer wrote:
I found a cheap ($10) bootable diagnostic (TuffTest V1.53) and have tried to run it. It locks up, but behaves the same on another server I have with the same MB/CPU/memory. (My tape server which is not used during the day.)
If you want to stress the CPU to spot e.g. a bad CPU, try 'mprime' - ftp://mersenne.org/gimps/mprime2414.tar.gz . /Per Jessen, Zürich
participants (7)
-
Carl Hartung
-
Greg Freemyer
-
Jim Cunning
-
Ken Schneider
-
Michael
-
Per Jessen
-
Scott Leighton