[Bug 231220] New: NFS related (?) kernel oops (Highmem machine)
https://bugzilla.novell.com/show_bug.cgi?id=231220 Summary: NFS related (?) kernel oops (Highmem machine) Product: openSUSE 10.2 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: seife@novell.com QAContact: qa@suse.de OtherBugsDependingO 215208 nThis: I got this oops when copying large file from a SLES10 NFS export to my local harddrive (all fs are ext3 on this machine). The machine has 2GB of RAM -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #1 from seife@novell.com 2007-01-01 10:23 MST ------- Created an attachment (id=111234) --> (https://bugzilla.novell.com/attachment.cgi?id=111234&action=view) oops -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #2 from seife@novell.com 2007-01-01 10:26 MST ------- The machine also locks up hard (no sysrq, nothing) quite often when transferring large amounts of data via NFS. This did not happen before the machine was upgraded to 2GB (it had 768MB before). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 kkeil@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |seife@novell.com ------- Comment #3 from kkeil@novell.com 2007-01-02 09:03 MST ------- Hmm 768 MB -> 2GB maybe only make a difference with not full 32 bit DMA and wrong DMA settings for the network card. Or bad memory. Some more info about the HW are needed for further investigation. Please also run memtest for a night to rule out bad memory (or wrong memory timing). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #4 from seife@novell.com 2007-01-02 10:00 MST ------- (In reply to comment #3)
Hmm 768 MB -> 2GB maybe only make a difference with not full 32 bit DMA and wrong DMA settings for the network card.
Some more info about the HW are needed for further investigation.
I will attach hwinfo --all The machine is a Panasonic Toughbook CF-51, Pentium M 1600MHz, i855 chipset with Realtek 8169 NIC.
Or bad memory. Please also run memtest for a night to rule out bad memory (or wrong memory timing).
Hehe, i expected that. I ran memtest lat night already, it ran 10 hours, 9 passes, without running into any trouble. The was rock solid before the memory upgrade and if i do not do huge NFS transfers, it is still rock solid. Note that i only got an oops once, most of the time this happens, it just locks up hard. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 seife@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|seife@novell.com | ------- Comment #5 from seife@novell.com 2007-01-02 10:01 MST ------- Created an attachment (id=111297) --> (https://bugzilla.novell.com/attachment.cgi?id=111297&action=view) hwinfo --all --log hwinfo.log -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 kkeil@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |seife@novell.com ------- Comment #6 from kkeil@novell.com 2007-01-02 10:25 MST ------- Did you use the wire (r8169) or wireless (ipw2200) network ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 seife@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|seife@novell.com | ------- Comment #7 from seife@novell.com 2007-01-02 11:09 MST ------- wired. It is not possible to create considerable network traffic via a wireless interface ;-)) I only have a 100mbit network, though. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #8 from kkeil@novell.com 2007-01-02 11:54 MST ------- Here is nothing known to the r8169 and DMA problems and I do not see anything suspicious in the driver source. Since these chips are used in many places I do not think it is the r8169 driver. To rule out NFS, can you try to reproduce with other high traffic (maybe copy some CD/DVD images via ftp or scp). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #9 from seife@novell.com 2007-01-03 06:11 MST ------- I only got the oops once. I can not prove it is NFS-related, i just guessed that since i am seeing hard hangs quite frequently when copying via NFS and writing to the local ext3 drive. I speculated that this might have to do with highmem, buffering and actually running out of lowmem. I will try to wget some large DVD images via http or ftp and see if the machine hangs up there, too. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |seife@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 seife@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|seife@novell.com | ------- Comment #10 from seife@novell.com 2007-01-14 05:19 MST ------- I have rebuilt my filesystems, they were all with 1k blocksize due to the yast2 bug until close-to-goldmaster-10.2. I can no longer get the machine to hang now (but it was not too clearly reproducible before...). My _uneducated_ guess what happened is: - write performance was abysmal due to ext3 with 1k blocksize - nfs reads with 10mb/second into (highmem?) fs cache - ext3 tries to write out, performance is bad - something in ext3 needs lowmem (inode cache maybe? something that needs much more space due to the 1k blocksize?) - now somewhere in a driver (with interrupts turned off, because it does not even react to sysrq) memory needs to be allocated, but it does not succeed. Maybe the lowmem accounting is wrong somewhere, so that there _should_ be memory free to use, but isn't? - With the 4k blocksize it does not happen, because the disk simply is much faster writing out than the network is filling the buffers? With the 1k blocksize, the machine was also very "jumpy" before locking up, means that i had latencies of 10's of seconds in the console when doing anything (ls for example), which would IMO support the "memory pressure" theory. This could of course be totally off topic :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 seife@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |npiggin@novell.com ------- Comment #11 from seife@novell.com 2007-01-15 04:49 MST ------- can this: http://permalink.gmane.org/gmane.linux.file-systems/12636 ([patch 0/10] buffered write deadlock fix) have something to do with this bug? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 lmb@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |npiggin@novell.com |maintainers@forge.provo.nove| |ll.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #12 from npiggin@novell.com 2007-01-21 21:44 MST ------- What's happened in the oops is that something invalid has been inserted into the pagecache radix tree. This would not completely surprise me if it were an NFS problem, but it could very well also be a single-bit flip if the hardware is flakey (ie. where a NULL turned into 0x800). If the oops cannot be reproduced, then it will be difficult to solve. Regarding the hangs -- 1k block size will use more "buffer heads" and will exercise some code paths that 4k block size does not. However it is interesting and unexpected that performance is so much worse in this (not terribly difficult) workload. Does the problem still occur on 1K filesystems? Finally, the deadlock problems are quite rare, and the result should not be a hung system in most cases (just a few hung programs), so I doubt that it is the cause of your hangs. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 ------- Comment #13 from seife@novell.com 2007-01-22 01:09 MST ------- I have no 1k filesystems anymore (which really improved the performance of this machien), and it has not hung anymore. The hangs could theoretically also be caused by a rogue driver (i am using the nozomi driver for my 3g datacard, but i only experienced hangs with it when i actually used the card), but actually the only thing that has changed since two weeks was that i changed the blocksize of all my filesystems. Oh - in case this might matter - i am using cryptoloop for my home, which also had 1k blocksize before... Well, maybe this bug report is an intertwingled mess up of unrelated issues, i am even not sure at all that the oops has anything to do with the former hard lockups, and since i changed the block size, i have not had any problems. As i understand it, the only way a memory problem could lock the machine hard would be a failing memory allocation with interrupts turned off... Maybe this is my call to fame to audit all the paths where this could happen in the kernel... once i have learned enough to actually read more than the most trivial kernel code. Executive summary: feel free to close this one as "worksforme". I will try to reproduce it, but other stuff has surely higher priority for now. Maybe somewhen we will stumble about something that will explain why this happened :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=231220 npiggin@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Major |Minor Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #14 from npiggin@novell.com 2007-01-22 17:09 MST ------- OK, if you can reproduce the hangs or bad performance then please open a new one. Thanks! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
participants (1)
-
bugzilla_noreply@novell.com