Re: [opensuse] How to troubleshoot system hang

9 Oct 2006


      On Wed, 2006-09-27 at 15:18 -0400, Peter Sjoberg wrote:
...
...
On Wednesday 27 September 2006 04:56, Peter Sjoberg wrote:
...
Normally I would agree but this happens to be my primary server so I put
some extra $$ on it and have matched pairs of Kingston DDR
PC3200/ECC/REG (KVR400D8R3AK2/1G) modules.
I hope you didn't pay /too/ much of a premium, Peter. ;-)
At the time I got them no-name 512Mb DDR3200 (no ECC or REG) was around
$65-$75 or $260-$300. I got two kits (2x 2x512=2G) for $320 so it wasn't
On Wed, 2006-09-27 at 08:46 -0400, Carl Hartung wrote:
that much extra.
...
The 'VR' in that part number stands for "Value RAM" which is Kingston's 
"industry standard" product line... meaning essentially 'generic' designs 
built using chips purchased on the open market so they can sell competitively 
'down market'... as compared to their more expensive premium product line.
Didn't know the whole story but guessed that "Value" wasn't the same as
Premium.
...
In any event, even buying premium parts from a well known and established 
manufacturer only *improves* the liklihood of a successful outcome. It 
*doesn't* guarantee that every part will operate perfectly fresh from the 
factory. There are just too many reasons for this to be true than I have room 
or time to elaborate on here.
...
Since ECC is enabled I would 
expect it to complain somewhere if it discovered ECC errors.
ECC only covers selected regions of a much larger spectrum of fault 
possibilities. I've been in the industry for almost 20 years and actually 
worked for a high end memory manufacturer in Silicon Valley before production 
moved offshore... so I know a little bit about how these things work. ;-)
One possibility is that all this BIOS ECC parameters are not optimal. I
don't know enough about ECC scrub, direction, 4-bit, DCACHE etc so I
left it all at it's default, just verified that ECC was enabled.
The mobo is a Tyan K8SE/S2892 and I have installed the latest bios
version.
...
...
Also, as a test I tried to provoke the system to hang by compile the
kernel in a loop, worked fine for 35h
This is a compelling factor and you could be 100% right. However, given the 
classic nature of the symptoms, if /I/ were managing this problem I'd 
wholesale swap the modules out with a premium set from another established 
manufacturer... even a set borrowed from another machine just as a test.
...
Your 
time and this system's downtime *must* be costing a lot more than the delta 
in price.
Nope, this is home server. I have my mails, nfs, /home, ldap, samba  etc
The nature is that it can go between 12h and 18 days between the hangs,
making it hard to declare fixed. One downside is that it seems to be
more often lately, haven't had uptime over 3 days for a while - which of
course points towards bad hw (since sw hasn't changed).
Swapping I could do but I have nothing to swap with, and buying more
memory as a test is not an option (but if it gets approved by the wife
department I might add 4x1G to it)
there but when it's down the only impact is the rest of the family and
that the mails are queued on a different server.
...
If you try this and the problem goes away, Kingston might even credit your 
existing purchase towards an upgrade to their premium line so they don't lose 
the sale. This is particularly true if they think you'll end up returning the 
parts for cause... which could happen if another brand solves the problem.
I'm planning on let it run memcheck for a while but donät expect to find
anything there.
...
Just some things to think about... YMMV and all that.
Thanks anyway
What I'm looking for is other ways to diagnos a system hang.
If it was some kind of memory or hw error I would expect the way it goes
would be a bit different each time but at least the last 5 times it has
been just about exactly the same, dead hang and only sysrq that works os
"b". For me this points a little towards some os issue. It could be raid
drivers (running sw raid5), Nic bonding drivers, or one of the many
other things that running, and I would like to know at least what area
to look in. The only thing I excluded right now is vmware server. I was
running that but at one time I removed it from the start and after next
crash it didn't start (leaving the kernel untainted) but the server
still died.
With the latest bios update came a Watchdog function but I didn't find
any driver for it so at the moment I have to turn it off or it will
reboot after the given watchdog time.
Just an update
I found some watchdog program and installed that. Didn't have to wait
long before I got it tested and it worked as designed and rebooted the
system when it did hang.
Looking at the timings of the hang I started to see that it often
happend when it was heavy disk activity (backup, some mirror scripts
etc) so I started to suspect something in the disk drivers (running sw
raid1 & raid5)
I was running the latest kernel but then a new one came out
(2.6.16.21-0.25-smp) so I upgraded. Specially after seeing in the
changelog that some raid fixes where installed.
Since I upgraded the kernel the system have now been running for 7 days
while previous it froze 3-5 times/week. I even pushed a few extra
backups, mirror scripts etc and it all seems to stay up.
I leave the watchdog stuff running and learned a bit about
troubleshooting so it's not all in wain.
...
...
hth & regards,
Carl
---------------------------------------------------------------------
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org
---------------------------------------------------------------------
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org
---------------------------------------------------------------------
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org