Re: [SLE] Segmentation fault + Aiee's + Oops !!

29 Feb 2000

      On Tue, 29 Feb 2000, john@vogue.demon.co.uk wrote:
...
==========================================================================
What follows is a complete history of the problems I've had over the last
2 days, as accurately as I can remember them.  The Aiee messages have been
clipped from /var/log/warn, and the Oops screen dump has been faithfully
copied down to the last digit.  /var/log/boot.msg is included at the end.
==========================================================================
Hello all,
Can someone please help me work this out.  I believe I have a hardware
problem, either in memory or the CPU, but am not sure which.  If someone
more experienced at diagnosing this type of problem on Linux can give the
following a once over, I'd be very grateful.
This is a brand new system, 1 week old.  It has an AMD Athlon 550Mhz CPU,
128MB of memory, two UDMA66 disks (8GB for /boot, swap, / and 20GB 7200rpm
for other data) and two 3Com 3C905B-TX network cards.  The system is
intended to be an internet server for my home network.  It's running SuSE
6.3 (2.2.13 kernel).
The first indication something was amiss was a system freeze up during the
SuSE install.  It just stopped in the middle of installing the contents of
CD1.  I could get nothing from the install screen, but was able to change 
virtual consoles to look at the install log.  Something in the install had
falled over with a Segmentation fault.  Restarting the install again seemed
to go through ok.
The next think I spotted was these strange errors appearing in /var/log/warn.
I've put the last two days worth together below:
kernel: iput: Aieee, semaphore in use inode 03:03/360840, count=0
kernel: iput: Aieee, atomic write semaphore in use inode 03:03/360840, count=0
kernel: iput: Aieee, semaphore in use inode 03:03/228759, count=0
kernel: iput: Aieee, atomic write semaphore in use inode 03:03/228759, count=0
kernel: iput: Aieee, semaphore in use inode 03:03/555542, count=0
kernel: iput: Aieee, atomic write semaphore in use inode 03:03/555542, count=0
What am I looking at here?  Are these semaphore's Thread resources, or system
V semaphores?   What does this message mean?
I've also been getting quite a number of these messages too:
modprobe: modprobe: Can't locate module char-major-15
During this time I noticed more strange Segmentation faults.  A couple of 
times while SuSEConfig runs various scripts after commiting changes made
in Yast (a really odd place to segfault).  But the worst was while trying
to build my own kernel for the first time.  I was trying to clean up some
stuff (like RAID, etc) and include support for the various firewall 
options.  make menuconfig ran ok, but 'make depend' got about about half
way through and then started to Segmentation fault processing various files.
I couldn't continue.  I have another system here, my older PC that I'd just
upgraded from SuSE 6.1 to 6.3, so I NFS mounted the kernel source tree and
copied it over to a new directory, remade the linux softlink and this time
the kernel built ok.
At this point I tried experimenting with ip_forward and ipchains. I couldn't
get it to work.  What struck me as odd is that I could ping the public 
network interface card, from a PC on the the private network, even when a
cat'ing the /proc ip_forward file clearly showed ip_forwarding to be 
turned off in the kernel (i.e. it returned 0).  I can't believe that is
correct.
Anyway, I gave up on this for a while (too tired with it) and switch to 
setting up squid for proxy http access.  This worked ok.
Then, this evening, as I was preparing to configure sendmail for the first
time, I fired up Yast from a telnet session and it Segmenation faulted. 
On the second attempt it hung the session.  I tried to run a ps(1) to 
find the pid from a local kvt and that also hung.  So I switched to
a virtual console and tried logging in, that hung also.  Back in X-windows
I tried a shutdown -r, that segfaulted, so did reboot.  By this point I
was VERY worried.  I had no choice but to power-cycle.
After forcing a reboot the system crashed while fsck'ing hdb2, an 18GB
partition.  This is the first time it's ever had to fsck all system partitions
contiguously before as they are usually in a clean state.  It crashed with
the following Oops message:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 03ff3000, %cr3 = 03ff3000
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c0120409>]
EFLAGS: 00010006
eax: c1efffc0   ebx: c1efffc0   ecx: 00000000   edx: c1fbd660
esi: 00000000   edi: c40ef740   ebp: 00000282   esp: c2a8bc9c
ds: 0018   es: 0018   ss: 0018
Process fsck.ext2 (pid: 17, process nr: 10, stackpage=c2a8b000)
Stack: 00000000 00000400 c01266b9 c40ef740 00000003 c1fbd660 00000000 c0126746
       00000000 00000400 00000400 c2147000 00000342 c2a8bcdc c2a8bcdc c2a8a000
       c2a8a000 00000000 c01272b5 c2147000 00000400 00000000 00000000 00000400
Call Trace: [<c01266b9>] [<c0126746>] [<c01272b5>] [c01260ea>] [<c0126296>] [<c0129a81>] [<c01cbfa7>] 1cc6a2>] [<c01d38fc>] [<c01cf97e>] [<c01d37ac>]
       [<c01247f5>] [<c01249ae>] [<c010900c>]
Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b
In the short time I've had the system it's not done this before.  What I 
believe has caused it is that I've changed the pattern of memory usage for
the system.  In setting up squid I gave it 16MB of memory cache to play 
with.  It's now running into more serious trouble than it was before.
In my experience, random segfaults like this occur due to one of three causes:
1) a portion of memory is bad.   
2) a CPU or CPU cache is bad
3) bad blocks in swap cause corruption when swapped out pages are read in
   again.
I'm not swapping at the moment so it's got to be one of the first two.  
Thing is, how to I tell?  I'm used to dealing with Tru64 Alpha systems,
where the binary event log traps and logs CPU Exceptions and bad memory
reads/writes.  Here, I'm blind.  My gut tells me it's memory, but I can't
confirm it.
Is there anything I can do other than start taking bits back to the PC
supplier?
I had some problems of unknown origin and finally tracked it down to the 3com
driver (module).
Are you using the 3com driver?
I would try not loading any of the 3com stuff and see what happens.
I did download the most recent driver from the NASA? (I think) website,
compiled it and that seemed to take care of the random problems.
It may have been due to a problem with SMP which you should not have.
Worth a try anyway.

--  Bob F

EMail  FBob@wt.net

A Truly Wise Man Never Plays 
 Leapfrog With A Unicorn...  

-- 
To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com
For additional commands send e-mail to suse-linux-e-help@suse.com             
Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/

Re: [SLE] Segmentation fault + Aiee's + Oops !!

FBob＠wt.net