Mailinglist Archive: opensuse (3222 mails)
|< Previous||Next >|
Re: [SLE] Segmentation fault + Aiee's + Oops !!
- From: FBob@xxxxxx (BobF)
- Date: Tue, 29 Feb 2000 17:45:08 -0600
- Message-id: <0002291752090E.01725@desk1>
On Tue, 29 Feb 2000, john@xxxxxxxxxxxxxxxxx wrote:
> What follows is a complete history of the problems I've had over the last
> 2 days, as accurately as I can remember them. The Aiee messages have been
> clipped from /var/log/warn, and the Oops screen dump has been faithfully
> copied down to the last digit. /var/log/boot.msg is included at the end.
> Hello all,
> Can someone please help me work this out. I believe I have a hardware
> problem, either in memory or the CPU, but am not sure which. If someone
> more experienced at diagnosing this type of problem on Linux can give the
> following a once over, I'd be very grateful.
> This is a brand new system, 1 week old. It has an AMD Athlon 550Mhz CPU,
> 128MB of memory, two UDMA66 disks (8GB for /boot, swap, / and 20GB 7200rpm
> for other data) and two 3Com 3C905B-TX network cards. The system is
> intended to be an internet server for my home network. It's running SuSE
> 6.3 (2.2.13 kernel).
> The first indication something was amiss was a system freeze up during the
> SuSE install. It just stopped in the middle of installing the contents of
> CD1. I could get nothing from the install screen, but was able to change
> virtual consoles to look at the install log. Something in the install had
> falled over with a Segmentation fault. Restarting the install again seemed
> to go through ok.
> The next think I spotted was these strange errors appearing in /var/log/warn.
> I've put the last two days worth together below:
> kernel: iput: Aieee, semaphore in use inode 03:03/360840, count=0
> kernel: iput: Aieee, atomic write semaphore in use inode 03:03/360840, count=0
> kernel: iput: Aieee, semaphore in use inode 03:03/228759, count=0
> kernel: iput: Aieee, atomic write semaphore in use inode 03:03/228759, count=0
> kernel: iput: Aieee, semaphore in use inode 03:03/555542, count=0
> kernel: iput: Aieee, atomic write semaphore in use inode 03:03/555542, count=0
> What am I looking at here? Are these semaphore's Thread resources, or system
> V semaphores? What does this message mean?
> I've also been getting quite a number of these messages too:
> modprobe: modprobe: Can't locate module char-major-15
> During this time I noticed more strange Segmentation faults. A couple of
> times while SuSEConfig runs various scripts after commiting changes made
> in Yast (a really odd place to segfault). But the worst was while trying
> to build my own kernel for the first time. I was trying to clean up some
> stuff (like RAID, etc) and include support for the various firewall
> options. make menuconfig ran ok, but 'make depend' got about about half
> way through and then started to Segmentation fault processing various files.
> I couldn't continue. I have another system here, my older PC that I'd just
> upgraded from SuSE 6.1 to 6.3, so I NFS mounted the kernel source tree and
> copied it over to a new directory, remade the linux softlink and this time
> the kernel built ok.
> At this point I tried experimenting with ip_forward and ipchains. I couldn't
> get it to work. What struck me as odd is that I could ping the public
> network interface card, from a PC on the the private network, even when a
> cat'ing the /proc ip_forward file clearly showed ip_forwarding to be
> turned off in the kernel (i.e. it returned 0). I can't believe that is
> Anyway, I gave up on this for a while (too tired with it) and switch to
> setting up squid for proxy http access. This worked ok.
> Then, this evening, as I was preparing to configure sendmail for the first
> time, I fired up Yast from a telnet session and it Segmenation faulted.
> On the second attempt it hung the session. I tried to run a ps(1) to
> find the pid from a local kvt and that also hung. So I switched to
> a virtual console and tried logging in, that hung also. Back in X-windows
> I tried a shutdown -r, that segfaulted, so did reboot. By this point I
> was VERY worried. I had no choice but to power-cycle.
> After forcing a reboot the system crashed while fsck'ing hdb2, an 18GB
> partition. This is the first time it's ever had to fsck all system partitions
> contiguously before as they are usually in a clean state. It crashed with
> the following Oops message:
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> current->tss.cr3 = 03ff3000, %cr3 = 03ff3000
> *pde = 00000000
> Oops: 0000
> CPU: 0
> EIP: 0010:[<c0120409>]
> EFLAGS: 00010006
> eax: c1efffc0 ebx: c1efffc0 ecx: 00000000 edx: c1fbd660
> esi: 00000000 edi: c40ef740 ebp: 00000282 esp: c2a8bc9c
> ds: 0018 es: 0018 ss: 0018
> Process fsck.ext2 (pid: 17, process nr: 10, stackpage=c2a8b000)
> Stack: 00000000 00000400 c01266b9 c40ef740 00000003 c1fbd660 00000000 c0126746
> 00000000 00000400 00000400 c2147000 00000342 c2a8bcdc c2a8bcdc c2a8a000
> c2a8a000 00000000 c01272b5 c2147000 00000400 00000000 00000000 00000400
> Call Trace: [<c01266b9>] [<c0126746>] [<c01272b5>] [c01260ea>] [<c0126296>] [<c0129a81>] [<c01cbfa7>] 1cc6a2>] [<c01d38fc>] [<c01cf97e>] [<c01d37ac>]
> [<c01247f5>] [<c01249ae>] [<c010900c>]
> Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b
> In the short time I've had the system it's not done this before. What I
> believe has caused it is that I've changed the pattern of memory usage for
> the system. In setting up squid I gave it 16MB of memory cache to play
> with. It's now running into more serious trouble than it was before.
> In my experience, random segfaults like this occur due to one of three causes:
> 1) a portion of memory is bad.
> 2) a CPU or CPU cache is bad
> 3) bad blocks in swap cause corruption when swapped out pages are read in
> I'm not swapping at the moment so it's got to be one of the first two.
> Thing is, how to I tell? I'm used to dealing with Tru64 Alpha systems,
> where the binary event log traps and logs CPU Exceptions and bad memory
> reads/writes. Here, I'm blind. My gut tells me it's memory, but I can't
> confirm it.
> Is there anything I can do other than start taking bits back to the PC
I had some problems of unknown origin and finally tracked it down to the 3com
Are you using the 3com driver?
I would try not loading any of the 3com stuff and see what happens.
I did download the most recent driver from the NASA? (I think) website,
compiled it and that seemed to take care of the random problems.
It may have been due to a problem with SMP which you should not have.
Worth a try anyway.
-- Bob F
A Truly Wise Man Never Plays
Leapfrog With A Unicorn...
To unsubscribe send e-mail to suse-linux-e-unsubscribe@xxxxxxxx
For additional commands send e-mail to suse-linux-e-help@xxxxxxxx
Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
|< Previous||Next >|