On Tue, 29 Feb 2000, john@vogue.demon.co.uk wrote:
========================================================================== What follows is a complete history of the problems I've had over the last 2 days, as accurately as I can remember them. The Aiee messages have been clipped from /var/log/warn, and the Oops screen dump has been faithfully copied down to the last digit. /var/log/boot.msg is included at the end. ==========================================================================
Hello all,
Can someone please help me work this out. I believe I have a hardware problem, either in memory or the CPU, but am not sure which. If someone more experienced at diagnosing this type of problem on Linux can give the following a once over, I'd be very grateful.
This is a brand new system, 1 week old. It has an AMD Athlon 550Mhz CPU, 128MB of memory, two UDMA66 disks (8GB for /boot, swap, / and 20GB 7200rpm for other data) and two 3Com 3C905B-TX network cards. The system is intended to be an internet server for my home network. It's running SuSE 6.3 (2.2.13 kernel).
The first indication something was amiss was a system freeze up during the SuSE install. It just stopped in the middle of installing the contents of CD1. I could get nothing from the install screen, but was able to change virtual consoles to look at the install log. Something in the install had falled over with a Segmentation fault. Restarting the install again seemed to go through ok.
The next think I spotted was these strange errors appearing in /var/log/warn. I've put the last two days worth together below:
kernel: iput: Aieee, semaphore in use inode 03:03/360840, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/360840, count=0 kernel: iput: Aieee, semaphore in use inode 03:03/228759, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/228759, count=0 kernel: iput: Aieee, semaphore in use inode 03:03/555542, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/555542, count=0
What am I looking at here? Are these semaphore's Thread resources, or system V semaphores? What does this message mean?
I've also been getting quite a number of these messages too:
modprobe: modprobe: Can't locate module char-major-15
During this time I noticed more strange Segmentation faults. A couple of times while SuSEConfig runs various scripts after commiting changes made in Yast (a really odd place to segfault). But the worst was while trying to build my own kernel for the first time. I was trying to clean up some stuff (like RAID, etc) and include support for the various firewall options. make menuconfig ran ok, but 'make depend' got about about half way through and then started to Segmentation fault processing various files.
I couldn't continue. I have another system here, my older PC that I'd just upgraded from SuSE 6.1 to 6.3, so I NFS mounted the kernel source tree and copied it over to a new directory, remade the linux softlink and this time the kernel built ok.
At this point I tried experimenting with ip_forward and ipchains. I couldn't get it to work. What struck me as odd is that I could ping the public network interface card, from a PC on the the private network, even when a cat'ing the /proc ip_forward file clearly showed ip_forwarding to be turned off in the kernel (i.e. it returned 0). I can't believe that is correct.
Anyway, I gave up on this for a while (too tired with it) and switch to setting up squid for proxy http access. This worked ok.
Then, this evening, as I was preparing to configure sendmail for the first time, I fired up Yast from a telnet session and it Segmenation faulted. On the second attempt it hung the session. I tried to run a ps(1) to find the pid from a local kvt and that also hung. So I switched to a virtual console and tried logging in, that hung also. Back in X-windows I tried a shutdown -r, that segfaulted, so did reboot. By this point I was VERY worried. I had no choice but to power-cycle.
After forcing a reboot the system crashed while fsck'ing hdb2, an 18GB partition. This is the first time it's ever had to fsck all system partitions contiguously before as they are usually in a clean state. It crashed with the following Oops message:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 current->tss.cr3 = 03ff3000, %cr3 = 03ff3000 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c0120409>] EFLAGS: 00010006 eax: c1efffc0 ebx: c1efffc0 ecx: 00000000 edx: c1fbd660 esi: 00000000 edi: c40ef740 ebp: 00000282 esp: c2a8bc9c ds: 0018 es: 0018 ss: 0018 Process fsck.ext2 (pid: 17, process nr: 10, stackpage=c2a8b000) Stack: 00000000 00000400 c01266b9 c40ef740 00000003 c1fbd660 00000000 c0126746 00000000 00000400 00000400 c2147000 00000342 c2a8bcdc c2a8bcdc c2a8a000 c2a8a000 00000000 c01272b5 c2147000 00000400 00000000 00000000 00000400 Call Trace: [<c01266b9>] [<c0126746>] [<c01272b5>] [c01260ea>] [<c0126296>] [<c0129a81>] [<c01cbfa7>] 1cc6a2>] [<c01d38fc>] [<c01cf97e>] [<c01d37ac>] [<c01247f5>] [<c01249ae>] [<c010900c>] Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b
In the short time I've had the system it's not done this before. What I believe has caused it is that I've changed the pattern of memory usage for the system. In setting up squid I gave it 16MB of memory cache to play with. It's now running into more serious trouble than it was before.
In my experience, random segfaults like this occur due to one of three causes:
1) a portion of memory is bad. 2) a CPU or CPU cache is bad 3) bad blocks in swap cause corruption when swapped out pages are read in again.
I'm not swapping at the moment so it's got to be one of the first two. Thing is, how to I tell? I'm used to dealing with Tru64 Alpha systems, where the binary event log traps and logs CPU Exceptions and bad memory reads/writes. Here, I'm blind. My gut tells me it's memory, but I can't confirm it.
Is there anything I can do other than start taking bits back to the PC supplier?
I had some problems of unknown origin and finally tracked it down to the 3com driver (module). Are you using the 3com driver? I would try not loading any of the 3com stuff and see what happens. I did download the most recent driver from the NASA? (I think) website, compiled it and that seemed to take care of the random problems. It may have been due to a problem with SMP which you should not have. Worth a try anyway. -- Bob F EMail FBob@wt.net A Truly Wise Man Never Plays Leapfrog With A Unicorn... -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/