Joop Beris wrote:
Hello listmates,
I hope some of you can help me by shedding some light on the current situation. In short, I'd like to know if my machine is dying or if there is a software issue. The machine in question is about 7 years old and has recently started experiencing lock-ups, kernel errors and kernel-oops'.
In /var/log/messages, I find the following things:
----------------------
Oct 23 00:00:04 magrathea kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000f60 <lots of nastiness snipped> Oct 23 00:00:04 magrathea kernel: Call Trace: Oct 23 00:00:04 magrathea kernel: [<c0158205>] __alloc_pages+0x60/0x2d6 Oct 23 00:00:04 magrathea kernel: [<c017e2b9>] get_locks_status+0x50/0xf0 Oct 23 00:00:04 magrathea kernel: [<c01a2a00>] locks_read_proc+0x10/0x25 Oct 23 00:00:04 magrathea kernel: [<c01a29f0>] locks_read_proc+0x0/0x25 Oct 23 00:00:04 magrathea kernel: [<c01a134f>] proc_file_read+0x10b/0x245 Oct 23 00:00:04 magrathea kernel: [<c01a1244>] proc_file_read+0x0/0x245 Oct 23 00:00:04 magrathea kernel: [<c017150c>] vfs_read+0xa6/0x12e Oct 23 00:00:04 magrathea kernel: [<c01718ec>] sys_read+0x41/0x67 Oct 23 00:00:04 magrathea kernel: [<c0104e22>] sysenter_past_esp+0x6b/0xa9 Oct 23 00:00:04 magrathea kernel: ======================= Oct 23 00:00:04 magrathea kernel: Code: b8 31 c0 a8 01 75 05 ba cc b8 31 c0 89 d0 89 1c 24 89 44 24 08 c7 44 24 04 65 3c 31 c0 e8 a7 3d 05 00 8b 4e 18 01 c3 85 ff 74 38 <8b> 87 9c 00 00 00 8b 50 08 8b 47 20 89 4c 24 08 89 1c 24 c7 44 Oct 23 00:00:04 magrathea kernel: EIP: [<c017d37f>] lock_get_status+0x172/0x239 SS:ESP 0068:d73bdee4
<snip>
I'd be happy if someone can help me find out if this is a software problem, a hardware issue or perhaps both. I should mention that this machine is running openSUSE 10.3, with all the latest updates. It has been running flawlessly for several years, without any lock-ups or anything.
Kind regards,
Joop
Joop, I've been through this twice in 8 years. Both times it was hardware. Once RAM and secondly a motherboard. To help diagnose, load mcelog. It is contained on your install DVD and will help identify any machine check exceptions (MCE) you are dealing with. If it catches any, you are 99.9% assured your issue is hardware. (there are very rare instances where code can trigger a mce - possible - but so is McCain becoming the 44th president) The nvidia card/driver most likely isn't the problem, but you can easily eliminate the binary driver issue by editing your xorg.conf and commenting out the nvidia driver line and replacing it with the nv driver: # Driver "nvidia" Driver "nv" then as root drop to runlevel 3 remove the nvidia kernel module and install the nv kernel module: telinit 3 # or logout and choose console login (ctrl+N), then rmmod nvidia && modprobe -v nv if you get any errors, then make a copy of your current xorg.conf (do it now -- save it before sax borks it) and use sax2 to gen a new one with the basic driver: cp /etc/X11/xorg.conf /etc/X11/xorg.conf.nvidia sax2 Then restart your window manager by "telinit 5" or just logout if you used console login and wait 5 seconds -- it should restart on its own. That way you can completely eliminate any "taint" on the kernel and you may get more help with the issue on the list. (If it is a software bug, you will have to do it anyway) After your satisfied the video card and nvidia driver aren't the problem, just rework the steps above and re-enable the nvidia driver. I have 3 boxes running the driver without issue. Additionally, I don't know where I got it, but I have an SuSE Machine Check Handling on Linux document by Andi Kleen (2004) that helps explain mce a little further. Your welcome to it at: http://www.3111skyline.com/download/linux/kernel/mce.pdf Good luck. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org