Re: [S.u.S.E. Linux] General Protection error
Going from horrible death, to lingering coma. Laszlo On Sun, 31 May 1998, Wolfgang Weisselberg wrote:
Hi!
Trying to kill the keyboard, laszlo@idt.net produced:
I'm getting a very regular General Protection error, on the following system ( set up to be a mail and www server runing X ). I've reformated and reinstalled twice, with more limited package selection each time. No difference was noted.
[...]
After every few hours, the system gets semi locked. It responds to <alt><Function key> commands, but not to <cntrl><alt><del>. <enter> at any screen, except login prompt gives the following.
general protection error: 0000 cpu : 0 eip : 0010:[<00124c207>]
<snip>
Try it with an open chassis (and some extra ventilators, probably houshold ones ... you *must* ensure a good circulation!), if that helps, it's a heat problem. Also at the time of the crash find out how hot the CPU is (the Cooler may probably not be attached properly). There are stripes aviable that change colour with temperature, they often give good indications.
Worked on the overheating scenario first, before mucking with the hardware. Removed the case, Pentium II cpu is running coolish to the touch, all cooling fans running smoothly. Increased space on rack between towers for better ventilation. I set up a cron job to send a mail at regular intervals to indicate when. the system becomes non-repsonsive. Outside temperature was in the low 50's Farenheit. Air conditioner was turned off to remove potential brown out conditions, power spikes. Room temperature remained cool. In spite of all this system had become non-repsonsive between 4:00 AM and 4:15 am. The system behavior was a bit different. Consoles allow you to switch between them <alt><F?> and allow you work thru the login sequence, reporting past failed login attempts and new mail notification. However, It seems that the shell(bash) never gets started, since there is no prompt and lack of keyboard/mouse response in that console. All network access telnet, httpd, etc. is dead, however, the server still responds to pings. /var/log/messages shows the faxclean queue being checked ( There is no queue ) without fail. After my first console login attempt the following error messages were generated every couple of seconds: Jun 2 11:13:14 www kernel: wait_queue is bad ( eip=0018948b ) q= 03fb8934 *q= 03f7cf68 Jun 2 11:13:16 www kernel: wait_queue is bad ( eip=0018948b ) q= 03fb8934 *q= 03f7cf68 Jun 2 11:13:17 www kernel: wait_queue is bad ( eip=0018948b ) q= 03fb8934 *q= 03f7cf68 Jun 2 11:13:20 www kernel: wait_queue is bad ( eip=0018948b ) q= 03fb8934 *q= 03f7cf68 Prior to all this there was an error message
Are you 110ure your memory chip(s) is/are OK (no, works under Win does not count, Windows does not use the memory to the full potential)? Try with one half of the chips if possible, or with (a) completely different one(s).
Playing with the hardware is my next step. Thanks, Laszlo -- To get out of this list, please send email to majordomo@suse.com with this text in its body: unsubscribe suse-linux-e
Hi! Trying to kill the keyboard, laszlo@idt.net produced:
On Sun, 31 May 1998, Wolfgang Weisselberg wrote:
[crash of system, regularly, even after reinstalls]
general protection error: 0000 cpu : 0 eip : 0010:[<00124c207>] [...]
Worked on the overheating scenario first, before mucking with the hardware. Removed the case, Pentium II cpu is running coolish to the touch, all cooling fans running smoothly. Increased space on rack between towers for better ventilation.
Coolish directly after switching on or after a few hours?
I set up a cron job to send a mail at regular intervals to indicate when. the system becomes non-repsonsive. Outside temperature was in the low 50's Farenheit. Air conditioner was turned off to remove potential brown out conditions, power spikes. Room temperature remained cool.
Ok, it ain't the A/C. Hmmm ... How reliable is your electrical power (over here you can easily get many months uptime without an UPS) or are you using an UPS?
In spite of all this system had become non-repsonsive between 4:00 AM and 4:15 am.
Have you any info of the times the system crashed previously? Is it connected to a LAN (where others could access it) or the Internet? What do the logs show around the crash time --- anything strange, any connections (apart from what you already wrote)?
The system behavior was a bit different. Consoles allow you to switch between them <alt><F?> and allow you work thru the login sequence, reporting past failed login attempts and new mail notification. However, It seems that the shell(bash) never gets started, since there is no prompt and lack of keyboard/mouse response in that console. All network access telnet, httpd, etc. is dead, however, the server still responds to pings.
That means the kernel is somehow still working (at least partially) ... but inetd is dead.
/var/log/messages shows the faxclean queue being checked ( There is no queue ) without fail.
So that is still running? Hmmm ...
After my first console login attempt the following error messages were generated every couple of seconds:
Jun 2 11:13:14 www kernel: wait_queue is bad ( eip=0018948b ) q= 03fb8934 *q= 03f7cf68 [every 2-3 secs] That sounds bad, it means that the scheduler is unhappy. Something strange's going on. Have a look into /usr/src/linux/kernel/sched.c to see where these messages are generated ...
It could be a problem with hardware ... actually, I think it is. If you have not done so already, you might want to recompile your kernel, configured exclusively for your machine.
Prior to all this there was an error message
Which one? (Come on, don't let me hang on that cliff ... :-)
Are you 110ure your memory chip(s) is/are OK (no, works under Win does
Playing with the hardware is my next step.
Ok. -Wolfgang -- PGP 2 welcome: Mail me, subject "send PGP-key". If you've nothing at all to hide, you must be boring. Unsolicited Bulk E-Mails: *You* pay for ads you never wanted. Is our economy _so_ weak we have to tolerate SPAMMERS? I guess not. -- To get out of this list, please send email to majordomo@suse.com with this text in its body: unsubscribe suse-linux-e
I did just about everything I could think of. Hardware reseated, kernel recompiled, took the case off the computer. Whatever it is, it seems to be working. The system is stable. I'm going to wait and see if it remains so. Besides the actual work on the server, the main environmental difference has been the outside temperature. It has been very cool outside for the past two days. This may be giving me a 'cleaner' electrical feed. In any case, I am giving serious consideration to having another 20amp circuit installed to another UPS. Especially if the server tends to die when the outside temperature goes up. Thanks for all your help, Laszlo -- To get out of this list, please send email to majordomo@suse.com with this text in its body: unsubscribe suse-linux-e
participants (2)
-
laszlo@idt.net
-
weissel@jupiter.ph-cip.uni-koeln.de