On Wed, 2006-09-27 at 15:18 -0400, Peter Sjoberg wrote:
On Wednesday 27 September 2006 04:56, Peter Sjoberg wrote:
Normally I would agree but this happens to be my primary server so I put some extra $$ on it and have matched pairs of Kingston DDR PC3200/ECC/REG (KVR400D8R3AK2/1G) modules.
I hope you didn't pay /too/ much of a premium, Peter. ;-) At the time I got them no-name 512Mb DDR3200 (no ECC or REG) was around $65-$75 or $260-$300. I got two kits (2x 2x512=2G) for $320 so it wasn't
On Wed, 2006-09-27 at 08:46 -0400, Carl Hartung wrote: that much extra.
The 'VR' in that part number stands for "Value RAM" which is Kingston's "industry standard" product line... meaning essentially 'generic' designs built using chips purchased on the open market so they can sell competitively 'down market'... as compared to their more expensive premium product line.
Didn't know the whole story but guessed that "Value" wasn't the same as Premium.
In any event, even buying premium parts from a well known and established manufacturer only *improves* the liklihood of a successful outcome. It *doesn't* guarantee that every part will operate perfectly fresh from the factory. There are just too many reasons for this to be true than I have room or time to elaborate on here.
Since ECC is enabled I would expect it to complain somewhere if it discovered ECC errors.
ECC only covers selected regions of a much larger spectrum of fault possibilities. I've been in the industry for almost 20 years and actually worked for a high end memory manufacturer in Silicon Valley before production moved offshore... so I know a little bit about how these things work. ;-)
One possibility is that all this BIOS ECC parameters are not optimal. I don't know enough about ECC scrub, direction, 4-bit, DCACHE etc so I left it all at it's default, just verified that ECC was enabled. The mobo is a Tyan K8SE/S2892 and I have installed the latest bios version.
Also, as a test I tried to provoke the system to hang by compile the kernel in a loop, worked fine for 35h
This is a compelling factor and you could be 100% right. However, given the classic nature of the symptoms, if /I/ were managing this problem I'd wholesale swap the modules out with a premium set from another established manufacturer... even a set borrowed from another machine just as a test.
Your time and this system's downtime *must* be costing a lot more than the delta in price. Nope, this is home server. I have my mails, nfs, /home, ldap, samba etc
The nature is that it can go between 12h and 18 days between the hangs, making it hard to declare fixed. One downside is that it seems to be more often lately, haven't had uptime over 3 days for a while - which of course points towards bad hw (since sw hasn't changed). Swapping I could do but I have nothing to swap with, and buying more memory as a test is not an option (but if it gets approved by the wife department I might add 4x1G to it) there but when it's down the only impact is the rest of the family and that the mails are queued on a different server.
If you try this and the problem goes away, Kingston might even credit your existing purchase towards an upgrade to their premium line so they don't lose the sale. This is particularly true if they think you'll end up returning the parts for cause... which could happen if another brand solves the problem.
I'm planning on let it run memcheck for a while but donät expect to find anything there.
Just some things to think about... YMMV and all that.
Thanks anyway
What I'm looking for is other ways to diagnos a system hang. If it was some kind of memory or hw error I would expect the way it goes would be a bit different each time but at least the last 5 times it has been just about exactly the same, dead hang and only sysrq that works os "b". For me this points a little towards some os issue. It could be raid drivers (running sw raid5), Nic bonding drivers, or one of the many other things that running, and I would like to know at least what area to look in. The only thing I excluded right now is vmware server. I was running that but at one time I removed it from the start and after next crash it didn't start (leaving the kernel untainted) but the server still died.
With the latest bios update came a Watchdog function but I didn't find any driver for it so at the moment I have to turn it off or it will reboot after the given watchdog time. Just an update I found some watchdog program and installed that. Didn't have to wait long before I got it tested and it worked as designed and rebooted the system when it did hang. Looking at the timings of the hang I started to see that it often happend when it was heavy disk activity (backup, some mirror scripts etc) so I started to suspect something in the disk drivers (running sw raid1 & raid5) I was running the latest kernel but then a new one came out (2.6.16.21-0.25-smp) so I upgraded. Specially after seeing in the changelog that some raid fixes where installed. Since I upgraded the kernel the system have now been running for 7 days while previous it froze 3-5 times/week. I even pushed a few extra backups, mirror scripts etc and it all seems to stay up.
I leave the watchdog stuff running and learned a bit about troubleshooting so it's not all in wain.
hth & regards,
Carl --------------------------------------------------------------------- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
--------------------------------------------------------------------- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
--------------------------------------------------------------------- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org