[SLE] Segmentation fault + Aiee's + Oops !!
========================================================================== What follows is a complete history of the problems I've had over the last 2 days, as accurately as I can remember them. The Aiee messages have been clipped from /var/log/warn, and the Oops screen dump has been faithfully copied down to the last digit. /var/log/boot.msg is included at the end. ========================================================================== Hello all, Can someone please help me work this out. I believe I have a hardware problem, either in memory or the CPU, but am not sure which. If someone more experienced at diagnosing this type of problem on Linux can give the following a once over, I'd be very grateful. This is a brand new system, 1 week old. It has an AMD Athlon 550Mhz CPU, 128MB of memory, two UDMA66 disks (8GB for /boot, swap, / and 20GB 7200rpm for other data) and two 3Com 3C905B-TX network cards. The system is intended to be an internet server for my home network. It's running SuSE 6.3 (2.2.13 kernel). The first indication something was amiss was a system freeze up during the SuSE install. It just stopped in the middle of installing the contents of CD1. I could get nothing from the install screen, but was able to change virtual consoles to look at the install log. Something in the install had falled over with a Segmentation fault. Restarting the install again seemed to go through ok. The next think I spotted was these strange errors appearing in /var/log/warn. I've put the last two days worth together below: kernel: iput: Aieee, semaphore in use inode 03:03/360840, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/360840, count=0 kernel: iput: Aieee, semaphore in use inode 03:03/228759, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/228759, count=0 kernel: iput: Aieee, semaphore in use inode 03:03/555542, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/555542, count=0 What am I looking at here? Are these semaphore's Thread resources, or system V semaphores? What does this message mean? I've also been getting quite a number of these messages too: modprobe: modprobe: Can't locate module char-major-15 During this time I noticed more strange Segmentation faults. A couple of times while SuSEConfig runs various scripts after commiting changes made in Yast (a really odd place to segfault). But the worst was while trying to build my own kernel for the first time. I was trying to clean up some stuff (like RAID, etc) and include support for the various firewall options. make menuconfig ran ok, but 'make depend' got about about half way through and then started to Segmentation fault processing various files. I couldn't continue. I have another system here, my older PC that I'd just upgraded from SuSE 6.1 to 6.3, so I NFS mounted the kernel source tree and copied it over to a new directory, remade the linux softlink and this time the kernel built ok. At this point I tried experimenting with ip_forward and ipchains. I couldn't get it to work. What struck me as odd is that I could ping the public network interface card, from a PC on the the private network, even when a cat'ing the /proc ip_forward file clearly showed ip_forwarding to be turned off in the kernel (i.e. it returned 0). I can't believe that is correct. Anyway, I gave up on this for a while (too tired with it) and switch to setting up squid for proxy http access. This worked ok. Then, this evening, as I was preparing to configure sendmail for the first time, I fired up Yast from a telnet session and it Segmenation faulted. On the second attempt it hung the session. I tried to run a ps(1) to find the pid from a local kvt and that also hung. So I switched to a virtual console and tried logging in, that hung also. Back in X-windows I tried a shutdown -r, that segfaulted, so did reboot. By this point I was VERY worried. I had no choice but to power-cycle. After forcing a reboot the system crashed while fsck'ing hdb2, an 18GB partition. This is the first time it's ever had to fsck all system partitions contiguously before as they are usually in a clean state. It crashed with the following Oops message: Unable to handle kernel NULL pointer dereference at virtual address 00000000 current->tss.cr3 = 03ff3000, %cr3 = 03ff3000 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c0120409>] EFLAGS: 00010006 eax: c1efffc0 ebx: c1efffc0 ecx: 00000000 edx: c1fbd660 esi: 00000000 edi: c40ef740 ebp: 00000282 esp: c2a8bc9c ds: 0018 es: 0018 ss: 0018 Process fsck.ext2 (pid: 17, process nr: 10, stackpage=c2a8b000) Stack: 00000000 00000400 c01266b9 c40ef740 00000003 c1fbd660 00000000 c0126746 00000000 00000400 00000400 c2147000 00000342 c2a8bcdc c2a8bcdc c2a8a000 c2a8a000 00000000 c01272b5 c2147000 00000400 00000000 00000000 00000400 Call Trace: [<c01266b9>] [<c0126746>] [<c01272b5>] [c01260ea>] [<c0126296>] [<c0129a81>] [<c01cbfa7>] 1cc6a2>] [<c01d38fc>] [<c01cf97e>] [<c01d37ac>] [<c01247f5>] [<c01249ae>] [<c010900c>] Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b In the short time I've had the system it's not done this before. What I believe has caused it is that I've changed the pattern of memory usage for the system. In setting up squid I gave it 16MB of memory cache to play with. It's now running into more serious trouble than it was before. In my experience, random segfaults like this occur due to one of three causes: 1) a portion of memory is bad. 2) a CPU or CPU cache is bad 3) bad blocks in swap cause corruption when swapped out pages are read in again. I'm not swapping at the moment so it's got to be one of the first two. Thing is, how to I tell? I'm used to dealing with Tru64 Alpha systems, where the binary event log traps and logs CPU Exceptions and bad memory reads/writes. Here, I'm blind. My gut tells me it's memory, but I can't confirm it. Is there anything I can do other than start taking bits back to the PC supplier? Regards, John -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
On Tue, 29 Feb 2000, john@vogue.demon.co.uk wrote:
========================================================================== What follows is a complete history of the problems I've had over the last 2 days, as accurately as I can remember them. The Aiee messages have been clipped from /var/log/warn, and the Oops screen dump has been faithfully copied down to the last digit. /var/log/boot.msg is included at the end. ==========================================================================
Hello all,
Can someone please help me work this out. I believe I have a hardware problem, either in memory or the CPU, but am not sure which. If someone more experienced at diagnosing this type of problem on Linux can give the following a once over, I'd be very grateful.
This is a brand new system, 1 week old. It has an AMD Athlon 550Mhz CPU, 128MB of memory, two UDMA66 disks (8GB for /boot, swap, / and 20GB 7200rpm for other data) and two 3Com 3C905B-TX network cards. The system is intended to be an internet server for my home network. It's running SuSE 6.3 (2.2.13 kernel).
The first indication something was amiss was a system freeze up during the SuSE install. It just stopped in the middle of installing the contents of CD1. I could get nothing from the install screen, but was able to change virtual consoles to look at the install log. Something in the install had falled over with a Segmentation fault. Restarting the install again seemed to go through ok.
The next think I spotted was these strange errors appearing in /var/log/warn. I've put the last two days worth together below:
kernel: iput: Aieee, semaphore in use inode 03:03/360840, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/360840, count=0 kernel: iput: Aieee, semaphore in use inode 03:03/228759, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/228759, count=0 kernel: iput: Aieee, semaphore in use inode 03:03/555542, count=0 kernel: iput: Aieee, atomic write semaphore in use inode 03:03/555542, count=0
What am I looking at here? Are these semaphore's Thread resources, or system V semaphores? What does this message mean?
I've also been getting quite a number of these messages too:
modprobe: modprobe: Can't locate module char-major-15
During this time I noticed more strange Segmentation faults. A couple of times while SuSEConfig runs various scripts after commiting changes made in Yast (a really odd place to segfault). But the worst was while trying to build my own kernel for the first time. I was trying to clean up some stuff (like RAID, etc) and include support for the various firewall options. make menuconfig ran ok, but 'make depend' got about about half way through and then started to Segmentation fault processing various files.
I couldn't continue. I have another system here, my older PC that I'd just upgraded from SuSE 6.1 to 6.3, so I NFS mounted the kernel source tree and copied it over to a new directory, remade the linux softlink and this time the kernel built ok.
At this point I tried experimenting with ip_forward and ipchains. I couldn't get it to work. What struck me as odd is that I could ping the public network interface card, from a PC on the the private network, even when a cat'ing the /proc ip_forward file clearly showed ip_forwarding to be turned off in the kernel (i.e. it returned 0). I can't believe that is correct.
Anyway, I gave up on this for a while (too tired with it) and switch to setting up squid for proxy http access. This worked ok.
Then, this evening, as I was preparing to configure sendmail for the first time, I fired up Yast from a telnet session and it Segmenation faulted. On the second attempt it hung the session. I tried to run a ps(1) to find the pid from a local kvt and that also hung. So I switched to a virtual console and tried logging in, that hung also. Back in X-windows I tried a shutdown -r, that segfaulted, so did reboot. By this point I was VERY worried. I had no choice but to power-cycle.
After forcing a reboot the system crashed while fsck'ing hdb2, an 18GB partition. This is the first time it's ever had to fsck all system partitions contiguously before as they are usually in a clean state. It crashed with the following Oops message:
Unable to handle kernel NULL pointer dereference at virtual address 00000000 current->tss.cr3 = 03ff3000, %cr3 = 03ff3000 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c0120409>] EFLAGS: 00010006 eax: c1efffc0 ebx: c1efffc0 ecx: 00000000 edx: c1fbd660 esi: 00000000 edi: c40ef740 ebp: 00000282 esp: c2a8bc9c ds: 0018 es: 0018 ss: 0018 Process fsck.ext2 (pid: 17, process nr: 10, stackpage=c2a8b000) Stack: 00000000 00000400 c01266b9 c40ef740 00000003 c1fbd660 00000000 c0126746 00000000 00000400 00000400 c2147000 00000342 c2a8bcdc c2a8bcdc c2a8a000 c2a8a000 00000000 c01272b5 c2147000 00000400 00000000 00000000 00000400 Call Trace: [<c01266b9>] [<c0126746>] [<c01272b5>] [c01260ea>] [<c0126296>] [<c0129a81>] [<c01cbfa7>] 1cc6a2>] [<c01d38fc>] [<c01cf97e>] [<c01d37ac>] [<c01247f5>] [<c01249ae>] [<c010900c>] Code: 8b 01 89 03 85 c0 74 2b 8b 73 04 85 f6 75 10 89 19 89 c8 2b
In the short time I've had the system it's not done this before. What I believe has caused it is that I've changed the pattern of memory usage for the system. In setting up squid I gave it 16MB of memory cache to play with. It's now running into more serious trouble than it was before.
In my experience, random segfaults like this occur due to one of three causes:
1) a portion of memory is bad. 2) a CPU or CPU cache is bad 3) bad blocks in swap cause corruption when swapped out pages are read in again.
I'm not swapping at the moment so it's got to be one of the first two. Thing is, how to I tell? I'm used to dealing with Tru64 Alpha systems, where the binary event log traps and logs CPU Exceptions and bad memory reads/writes. Here, I'm blind. My gut tells me it's memory, but I can't confirm it.
Is there anything I can do other than start taking bits back to the PC supplier?
I had some problems of unknown origin and finally tracked it down to the 3com driver (module). Are you using the 3com driver? I would try not loading any of the 3com stuff and see what happens. I did download the most recent driver from the NASA? (I think) website, compiled it and that seemed to take care of the random problems. It may have been due to a problem with SMP which you should not have. Worth a try anyway. -- Bob F EMail FBob@wt.net A Truly Wise Man Never Plays Leapfrog With A Unicorn... -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
Whoops, forgot the boot.msg Inspecting /boot/System.map-2.2.13 Loaded 8436 symbols from /boot/System.map-2.2.13. Symbols match kernel version 2.2.13. No module symbols loaded. klogd 1.3-3, log source = ksyslog started. <4>Linux version 2.2.13 (root@wireless-065-099) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 Mon Feb 28 23:25:25 GMT 2000 <4>Detected 548944745 Hz processor. <4>Console: colour VGA+ 80x25 <4>Calibrating delay loop... 547.23 BogoMIPS <4>Memory: 63776k/66496k available (1224k kernel code, 416k reserved, 1032k data, 48k init, 0k bigmem) <4>DENTRY hash table entries: 16384 (order: 5, 131072 bytes) <4>Buffer-cache hash table entries: 65536 (order: 6, 262144 bytes) <4>Page-cache hash table entries: 16384 (order: 4, 65536 bytes) <5>VFS: Diskquotas version dquot_6.4.0 initialized <4>L1 I Cache: 64K L1 D Cache: 64K <4>L2 Cache: 512K <4>CPU: AMD AMD Athlon(tm) Processor stepping 01 <6>Checking 386/387 coupling... OK, FPU using exception 16 error reporting. <6>Checking 'hlt' instruction... OK. <4>POSIX conformance testing by UNIFIX <4>mtrr: v1.35a (19990819) Richard Gooch (rgooch@atnf.csiro.au) <4>PCI: PCI BIOS revision 2.10 entry at 0xfb480 <4>PCI: Using configuration type 1 <4>PCI: Probing PCI hardware <4>PCI: Enabling I/O for device 00:00 <6>Linux NET4.0 for Linux 2.2 <6>Based upon Swansea University Computer Society NET3.039 <6>NET4: Unix domain sockets 1.0 for Linux NET4.0. <6>NET4: Linux TCP/IP 1.0 for NET4.0 <6>IP Protocols: ICMP, UDP, TCP, IGMP <4>Initializing RT netlink socket <4>Starting kswapd v 1.5 <6>Detected PS/2 Mouse Port. <4>pty: 256 Unix98 ptys configured <6>Real Time Clock Driver v1.09 <4>RAM disk driver initialized: 16 RAM disks of 64000K size <6>loop: registered device at major 7 <6>Uniform Multi-Platform E-IDE driver Revision: 6.20 <4>AMD7409: IDE controller on PCI bus 00 dev 39 <4>AMD7409: not 100% native mode: will probe irqs later <4> ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:DMA, hdb:DMA <4> ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:DMA, hdd:pio <4>hda: Maxtor 90871U2, ATA DISK drive <4>hdb: Maxtor 92049U6, ATA DISK drive <4>hdc: DELTA OPC-K101/ST1 F/W by OIPD, ATAPI CDROM drive <4>ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 <4>ide1 at 0x170-0x177,0x376 on irq 15 <6>hda: Maxtor 90871U2, 8297MB w/512kB Cache, CHS=1057/255/63 <6>hdb: Maxtor 92049U6, 19544MB w/2048kB Cache, CHS=2491/255/63 <4>hdc: ATAPI 40X CD-ROM drive, 128kB Cache, UDMA(33) <6>Uniform CDROM driver Revision: 2.56 <6>Floppy drive(s): fd0 is 1.44M <6>FDC 0 is a post-1991 82077 <4>md driver 0.36.6 MAX_MD_DEV=4, MAX_REAL=8 <4>scsi : 0 hosts. <4>scsi : detected total. <6>3Com 3c90x Version 1.0.0 1999 <linux_driver@3com.com> <4>Partition check: <4> hda: hda1 hda2 hda3 <4> hdb: hdb1 hdb2 <4>VFS: Mounted root (ext2 filesystem) readonly. <4>Freeing unused kernel memory: 48k freed <6>Adding Swap: 136544k swap-space (priority -1) Kernel logging (ksyslog) stopped. Kernel log daemon terminating. -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
On Tue, 29 Feb 2000, BobF wrote:
I had some problems of unknown origin and finally tracked it down to the 3com driver (module). Are you using the 3com driver? I would try not loading any of the 3com stuff and see what happens. I did download the most recent driver from the NASA? (I think) website, compiled it and that seemed to take care of the random problems. It may have been due to a problem with SMP which you should not have. Worth a try anyway.
I'll look into it and see if I can get a more recent driver. I'm not convinced this is the problem though as: 1) The first segfault I ever saw on this box occured while installing off CD1 2) The Oops crash occured during fsck.ext2, before the system even hit single user mode. The network cards would not have been active at that point. Thanks for the suggestion. Regards, John -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
Well, I'm encouraged. Thanks to BobF for pointing me in the right direction. I found the home page for Memtest86, pulled it off, built it, installed it in LILO and booted it. So far it's found 89 errors in just two passes. Both times the errors came from test 6. I'm going to leave it running overnight to see how many it clocks up. In any case, I think I've got some solid evidence to take the memory back to the supplier, who fortunately is just 5 minutes drive away. I'll write again in a few days time when I have more news. Regards, John -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
* BobF (FBob@wt.net) [20000301 00:53]: Bob, while I appreciate your post, could you please reduce the quote in the future? There was no need to quote 132 lines of mail for the answer you gave. Please remember that there are people who pay quite dearly for their internet access (here in Germany e.g. the cheapest ist about 2 cent per minute, flat rates are about $70 per month). cheerio Philipp -- Philipp Thomas <pthomas@suse.de> SuSE GmbH, Deutschherrenstrasse 15-29, 90429 Nuremberg The only difference between a bug and a feature is you can turn a feature off. -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
* john@vogue.demon.co.uk (john@vogue.demon.co.uk) [20000301 08:51]:
1) The first segfault I ever saw on this box occured while installing off CD1 2) The Oops crash occured during fsck.ext2, before the system even hit single user mode. The network cards would not have been active at that point.
John, what disks are you using? There are known issues with some UDMA disks (notably some WD drives). Which kernel are you using? On a general note, it's normally a good idea to be as specific on the hardware involved as one can be (processor, mobo, type of disks etc.). This helps people trying to spot the possible culprit. Philipp -- Philipp Thomas <pthomas@suse.de> SuSE GmbH, Deutschherrenstrasse 15-29, 90429 Nuremberg The only difference between a bug and a feature is you can turn a feature off. -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
* john@vogue.demon.co.uk (john@vogue.demon.co.uk) [20000301 00:28]:
modprobe: modprobe: Can't locate module char-major-15
To find out which device such messages relate to, have a look at /usr/src/linux/Documentation/devices.txt (if you've got the kernel sources installed). If you don't have the sources installed, you'll find the same text in /usr/doc/packages/lx_docu (or something similiar, I don't remember the exact location right now). Regarding the memory: is this one 128 meg module or two 64 meg ones? If the latter, make shure they're of the exact same type and manufacturer. Also check memory timing settings in the BIOS. Some BIOSes fail to set it up correctly, resulting in just the errors you observed. Philipp -- Philipp Thomas <pthomas@suse.de> SuSE GmbH, Deutschherrenstrasse 15-29, 90429 Nuremberg The only difference between a bug and a feature is you can turn a feature off. -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
On Thu, 02 Mar 2000, Philipp Thomas wrote:
* BobF (FBob@wt.net) [20000301 00:53]:
Bob, while I appreciate your post, could you please reduce the quote in the future? There was no need to quote 132 lines of mail for the answer you gave.
Please remember that there are people who pay quite dearly for their internet access (here in Germany e.g. the cheapest ist about 2 cent per minute, flat rates are about $70 per month).
How can I cash in on this racket??? On a more serious note- I try to quote relavent parts of an email so that someone reading one of many messages in an archive will be able to get a full picture of the problem and proposed solutions and see if the thread is pertinent to whatever they are researching. I would encourage others to do so too. I have found in dealing with computers over the past 31 years that there always seems to be too little information available to troubleshoot a problem rather than too much. Quite a few of the posts to this list are lacking in this respect. I seem to recall that in the next message of this thread there was a request for additional info. My apologies to those who have had to unwillingly pay for my $ .02 worth. Bob F EMail FBob@wt.net A Truly Wise Man Never Plays Leapfrog With A Unicorn... -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
On Thu, 2 Mar 2000, Philipp Thomas wrote:
what disks are you using? There are known issues with some UDMA disks (notably some WD drives). Which kernel are you using?
They are Maxtor drives. It's a moot point now though. The result from running Memtest86 over night was 16 passes (complete runs), with 748 failures generated by test 6. I got the supplier to swap the memory (a single 128MB simm) yesterday, reran the tests (no faults this time) and now all my problems appear to have gone away. No more segmentation faults rebuilding the kernel, or anywhere else so far for that matter. Thanks for the advise folks. Cheers, John -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/Support/Doku/FAQ/
[Oops! I just stumbled on a bunch of apparently unsent messages. Sorry if this is a duplicate.] BobF <FBob@wt.net> writes:
On Thu, 02 Mar 2000, Philipp Thomas wrote:
while I appreciate your post, could you please reduce the quote in the future? [...]
I try to quote relavent parts of an email so that someone reading one of many messages in an archive will be able to get a full picture of the problem [...] I would encourage others to do so too.
I'm not taking position in this mini-debate, except to say that proper quoting is an art, and a difficult one. I just hate when someone repeats the full original message, quoted, in a reply. Of course, one may argue for relevance (this is the message to which one replies, isn't it? :-), yet pure laziness is usually the genuine, real source of the phenomenon. On the other hand, giving no context at all is not a good thing either. When someone writes to me, without any kind of quote: "Thanks for your reply." or "I agree with you.", I usually do not have much idea of what it is about. Luckily enough, some mail user agents insert References in the mail header while building a reply, which may be used to recover the original message, but many do not. The full original message is usually too much anyway. Best is to learn how to quote parsimoniously. -- François Pinard http://www.iro.umontreal.ca/~pinard -- To unsubscribe send e-mail to suse-linux-e-unsubscribe@suse.com For additional commands send e-mail to suse-linux-e-help@suse.com Also check the FAQ at http://www.suse.com/support/faq
participants (4)
-
FBob@wt.net
-
john@vogue.demon.co.uk
-
pinard@iro.umontreal.ca
-
pthomas@suse.de