[opensuse-kernel] x86_86 kernel lock-ups with recent (post Beta1) kernels
I have been encountering some intermittent lockups while running recent 2.6.22.3-*-default kernels on a HP dv6400 laptop (Turion64 X2 TL-56 processor) These kernels where sync'd out on factory some time after Friday (17 Aug) (I recall a 2.6.22.3-2 and currently the kernel-default-2.6.22.3-5) I did not have this problem with the previous -default kernels from Factory. I am now running a kernel-vanilla-2.6.22.3-20070818105014 (from /pub/projects/kotd/HEAD) without encountering any lockups. # rpm -q kernel-vanilla --changelog | head -3 * Sat Aug 18 2007 - trenn@suse.de - patches.arch/acpi_autoloading_ia64_hp_fix.patch: Use acpi_device_id for IA64 HP driver, otherwise those fail to boot. I have tried "nmi_watchdog=2 crashkernel=..." to try get a dump that will shed some light on the issue but no luck. I have noticed some boot issues stemming from boot.clock so added a file /etc/modprobe.d/rtc which contains: blacklist rtc_cmos blacklist rtc_core blacklist rtc_lib Even with the above blacklisted, the system still has intermittent lockups with the -default kernel, some times shortly after I get logged in. Other times it will take over an hour before a lockup occurs. Any tips or suggestions? # rpm -q kernel-default-2.6.22.3-5 --changelog | head -3 * Fri Aug 17 2007 - teheo@suse.de - patches.drivers/scsi-throttle-SG_DXFER_TO_FROM_DEV-warning-better: SCSI: throttle SG_DXFER_TO_FROM_DEV warning message better # rpm -q kernel-vanilla-2.6.22.3-20070818105014 --changelog | head -7 * Sat Aug 18 2007 - trenn@suse.de - patches.arch/acpi_autoloading_ia64_hp_fix.patch: Use acpi_device_id for IA64 HP driver, otherwise those fail to boot. * Fri Aug 17 2007 - teheo@suse.de - patches.drivers/scsi-throttle-SG_DXFER_TO_FROM_DEV-warning-better: SCSI: throttle SG_DXFER_TO_FROM_DEV warning message better -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
* Warren Stockton <wns@comcast.net> [2007-08-21 21:52]:
I have tried "nmi_watchdog=2 crashkernel=..." to try get a dump that will shed some light on the issue but no luck.
Did you also install kdump userspace programs (kexec-tools) and enabled kdump via "chkconfig kdump on"? Also, please file a bugreport in Bugzilla (http://bugzilla.novell.com). Thanks, Bernhard -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Tuesday 21 August 2007 13:54:40 Bernhard Walle wrote:
Did you also install kdump userspace programs (kexec-tools) and enabled kdump via "chkconfig kdump on"? Yes... If it would just Oops, I would be all set. # /etc/init.d/kdump status kdump kernel loaded
# crash ... KERNEL: /boot/vmlinux-2.6.22.3-5-default DUMPFILE: /dev/mem ... crash> nmi_watchdog nmi_watchdog = $1 = 2
Also, please file a bugreport in Bugzilla (http://bugzilla.novell.com). Will do...
-- Warren Stockton mailto:wns@comcast.net -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Aug 21 2007 13:52, Warren Stockton wrote:
Subject: [opensuse-kernel] x86_86 kernel lock-ups with recent (post Beta1) kernels
I really want to have an x86_86 too! It's got 34% more bits per register, it must be good. SCNR, the other time I saw x86_86 was on the git usage survey...
Even with the above blacklisted, the system still has intermittent lockups with the -default kernel, some times shortly after I get logged in. Other times it will take over an hour before a lockup occurs.
Any tips or suggestions?
Trying without NO_HZ, without VOLUNTARY_PREEMPT, well, and perhaps trying to catach a kernel oops, should there be one on the console. Jan -- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Tuesday 21 August 2007 13:57:33 Jan Engelhardt wrote:
Trying without NO_HZ, without VOLUNTARY_PREEMPT, well, and perhaps trying to catach a kernel oops, should there be one on the console. # egrep "NO_HZ|PREEMPT" /boot/config-2.6.22.3-* /boot/config-2.6.22.3-20070818105014-vanilla:# CONFIG_PREEMPT_NONE is not set /boot/config-2.6.22.3-20070818105014-vanilla:CONFIG_PREEMPT_VOLUNTARY=y /boot/config-2.6.22.3-20070818105014-vanilla:# CONFIG_PREEMPT is not set /boot/config-2.6.22.3-20070818105014-vanilla:# CONFIG_PREEMPT_BKL is not set /boot/config-2.6.22.3-5-default:# CONFIG_PREEMPT_NONE is not set /boot/config-2.6.22.3-5-default:CONFIG_PREEMPT_VOLUNTARY=y /boot/config-2.6.22.3-5-default:# CONFIG_PREEMPT is not set /boot/config-2.6.22.3-5-default:# CONFIG_PREEMPT_BKL is not set
NO_HZ is not present in either of the kernels Just checked another system running Beta1 packages (from DVD.iso) # egrep "NO_HZ|PREEMPT" /boot/config-2.6.22.1-16-default # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set So CONFIG_PREEMPT_VOLUNTARY=y never presented a problem in the previous kernels. I have yet to see an Oops in the log files. I have tried flipping to F10 to see if I get a hint of an Oops during the boot but nothing. Most of the time its is while working in KDE, the system just locks up. I will try build and boot a 2.6.22.3-default kernel with CONFIG_PREEMPT_NONE=y later tonight and see if the problem disappears. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Tuesday 21 August 2007 22:41:45 Warren Stockton wrote:
On Tuesday 21 August 2007 13:57:33 Jan Engelhardt wrote:
Trying without NO_HZ, without VOLUNTARY_PREEMPT, well, and perhaps trying to catach a kernel oops, should there be one on the console.
# egrep "NO_HZ|PREEMPT" /boot/config-2.6.22.3-* /boot/config-2.6.22.3-20070818105014-vanilla:# CONFIG_PREEMPT_NONE is not set /boot/config-2.6.22.3-20070818105014-vanilla:CONFIG_PREEMPT_VOLUNTARY=y /boot/config-2.6.22.3-20070818105014-vanilla:# CONFIG_PREEMPT is not set /boot/config-2.6.22.3-20070818105014-vanilla:# CONFIG_PREEMPT_BKL is not set /boot/config-2.6.22.3-5-default:# CONFIG_PREEMPT_NONE is not set /boot/config-2.6.22.3-5-default:CONFIG_PREEMPT_VOLUNTARY=y /boot/config-2.6.22.3-5-default:# CONFIG_PREEMPT is not set /boot/config-2.6.22.3-5-default:# CONFIG_PREEMPT_BKL is not set
NO_HZ is not present in either of the kernels
It is currently an i386 only feature. -Joachim -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
I have yet to see an Oops in the log files. I have tried flipping to F10 to see if I get a hint of an Oops during the boot but nothing. Most of the time its is while working in KDE, the system just locks up.
When you have a second box around you could configure netconsole (see /usr/src/linux/Documentation/netconsole*) and see if that catches an oops. Or use a serial console. -Andi -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wednesday 22 August 2007 05:45:03 Andi Kleen wrote:
When you have a second box around you could configure netconsole (see /usr/src/linux/Documentation/netconsole*) and see if that catches an oops. Or use a serial console. I will try a netconsole. A serial console is not an option since I have no serial port on this laptop. Please correct me if I am wrong, but even with a USB/serial adapter, there is too much USB stack in the way to configure a USB serial console.
Last night I installed kernel-default-2.6.22.3-7 from Factory (as well as all the updated packages.) This kernel ran about 30 minutes before a lockup. I then added "irqpoll" and tried again. This time the -default kernel ran for 5.5 hours until I shut it down. This morning I tried again without irqpoll and once again had a lockup after about 30 minutes. I am currently running with irqpoll again and will see if it runs all day without a lockup. Assuming irqpoll works around the issue, why does a -vanilla kernel work without irqpoll but -default needs irqpoll? (I have already considered ndiswrapper and I removed the kmp when I found that -vanilla worked) I will try get a list of additional drivers that get loaded when running -default vs -vanilla kernel. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wed, Aug 22, 2007 at 09:08:40AM -0600, Warren Stockton wrote:
On Wednesday 22 August 2007 05:45:03 Andi Kleen wrote:
When you have a second box around you could configure netconsole (see /usr/src/linux/Documentation/netconsole*) and see if that catches an oops. Or use a serial console. I will try a netconsole. A serial console is not an option since I have no serial port on this laptop. Please correct me if I am wrong, but even with a USB/serial adapter, there is too much USB stack in the way to configure a USB serial console.
That's correct. There is EHCI debug port for USB, but the drivers for that are not currently included and it needs a hard to get and e xpensive calbe. firescope (console over firewire) can be also used as an cheap alternative, but let's try netconsole first.
Last night I installed kernel-default-2.6.22.3-7 from Factory (as well as all the updated packages.) This kernel ran about 30 minutes before a lockup. I then added "irqpoll" and tried again. This time the -default kernel ran for 5.5 hours until I shut it down.
Ok that narrows it down somewhat.
This morning I tried again without irqpoll and once again had a lockup after about 30 minutes. I am currently running with irqpoll again and will see if it runs all day without a lockup.
Assuming irqpoll works around the issue, why does a -vanilla kernel work without irqpoll but -default needs irqpoll? (I have already considered
Hmm, that would point to that one of the patches included in the suse kernel is to blame. Possible culprits are libata and ACPI and possibly alsa I would say. One way to track it down would be to build custom kernels with these patch blocks removed and see which one helps. Would be somewhat time consuming though Having some kind of console output would narrow it down. -Andi -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wednesday 22 August 2007 10:42:11 Andi Kleen wrote: > Hmm, that would point to that one of the patches included in the > suse kernel is to blame. > > Possible culprits are libata and ACPI and possibly alsa I would say. I don't know if the following is of much help, but when using kernel-default-2.6.22.3-7 w/ irqpoll, I did notice that ACPI events are not working perfectly... - Power button does not initiate suspend-to-disk - suspend-to-disk from kpowersave context menu did work but on resume I had a popup stating suspend-to-disk failed instead of the screenlock... - Lid switch does still does the screenlock ... > One way to track it down would be to build custom kernels > with these patch blocks removed and see which one helps. > Would be somewhat time consuming though I am in the process of downloading kernel-source-2.6.22.3-7.src.rpm ... Which patch would you like me to try disabling first? > Having some kind of console output would narrow it down. No luck here... I used an "insmod .../netconsole.ko netconsole=... " to enable netconsole after the system booted. The menu.lst entry did work and then it dawned on me I was dealing with a module. I used "echo '?" > /proc/sysrq-trigger" to make sure it was working... About 3 minutes after the insmod, the "Disabling IRQ" showed up. The system was still working fine and I was starting to wonder if netconsole was also doing some irq polling that was masking the problem. About 40 minutes after the "Disabling IRQ" message the system locked up with no additional netconsole output. I waited another 10 minutes hoping nmi_watchdog would kick loose but no luck. A sysrq key-in is of no use because pressing the caps lock key won't even toggle the CAPS LED. # cat netconsole.log nohup: ignoring input SysRq : HELP : loglevel0-8 reBoot Crashdump tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks SysRq : HELP : loglevel0-8 reBoot Crashdump tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks Disabling IRQ #7 The IRQ #7 corresponds to sdhci. # cat /proc/interrupts | grep 7: 7: 23858 393125 IO-APIC-fasteoi sdhci:slot0 I have not had any troubles using the SD card reader on previous kernels (beta1 or earlier) and right now using kernel-default-2.6.22.3-7 w/ irqpoll, the SD card reader is working fine. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wed, Aug 22, 2007 at 01:16:17PM -0600, Warren Stockton wrote:
One way to track it down would be to build custom kernels with these patch blocks removed and see which one helps. Would be somewhat time consuming though I am in the process of downloading kernel-source-2.6.22.3-7.src.rpm ...
I built you a couple of test rpms (see other mail) to test.
The IRQ #7 corresponds to sdhci. # cat /proc/interrupts | grep 7: 7: 23858 393125 IO-APIC-fasteoi sdhci:slot0
I have not had any troubles using the SD card reader on previous kernels (beta1 or earlier) and right now using kernel-default-2.6.22.3-7 w/ irqpoll, the SD card reader is working fine.
That's driven by libata right? -Andi -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wednesday 22 August 2007 14:36:03 Andi Kleen wrote:
I built you a couple of test rpms (see other mail) to test. I have downloaded them and just about to test in order 0..4 I will try use a boot.local to insmod netconsole (Pity /etc/init.d/boot.local is not valid)
The IRQ #7 corresponds to sdhci. # cat /proc/interrupts | grep 7: 7: 23858 393125 IO-APIC-fasteoi sdhci:slot0
I have not had any troubles using the SD card reader on previous kernels (beta1 or earlier) and right now using kernel-default-2.6.22.3-7 w/ irqpoll, the SD card reader is working fine.
That's driven by libata right? I don't believe that's the case. # lsmod | grep sdhci sdhci 34828 0 mmc_core 46856 2 mmc_block,sdhci
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
For testing I built you a couple of kernel RPMS with various groups of patches disabled. ftp://ftp.suse.com/pub/people/ak/test2/0 ... 4 [sync still running, might be up in a few minutes) 0 is the kernel with all patches for a control run, 1-4 are without ACPI, libata, ALSA, ieee1394 respectively. Can you please test which kernel doesn't show the problem? The netconsole output would be also still useful. Thanks, -Andi -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wednesday 22 August 2007 14:00:39 Andi Kleen wrote:
For testing I built you a couple of kernel RPMS with various groups of patches disabled.
ftp://ftp.suse.com/pub/people/ak/test2/0 ... 4 [sync still running, might be up in a few minutes)
0 is the kernel with all patches for a control run, 1-4 are without ACPI, libata, ALSA, ieee1394 respectively.
Can you please test which kernel doesn't show the problem? 63df85b3deb2e413cd22d761f0e29dc2 0/kernel-default-2.6.22.4-20070822145631.x86_64.rpm Locked up after 7 min
8001b2044c30759eaa1cd0f156bf179b 1/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 14 mintes 3c2cc20694224819c32fb007af3fad83 2/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 15 minutes d758e90c49b033b79fa6a9a7eb479d7c 3/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 19 minutes 827eb27d1529ba70969ae63dbc91fd98 4/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 26 minutes I don't think there is much value in how long each one ran before the lockup occurred. I have seen the lockups occur while executing boot scripts (which also happened with the '3 - no ALSA patch' kernel).
The netconsole output would be also still useful. I did the "insmod ...netconsole..." from /etc/init.d/boot.local (Wasted some time until I realized it did not need an insserv.) The only netconsole output captured for each of these kernels was the netconsole "device eth0 not up yet" and "Disabling IRQ #7" as below: netconsole: local port 4444 netconsole: local IP 192.168.... netconsole: interface eth0 netconsole: remote port 9353 netconsole: remote IP 192.168.... netconsole: remote ethernet address 00:03:47:23:e3:df netconsole: device eth0 not up yet, forcing it netconsole: carrier detect appears untrustworthy, waiting 4 seconds netconsole: network logging started Disabling IRQ #7
The "Disabling IRQ #7" always took a couple of minutes to appear and always preceeded the lockup by several minutes. I also noticed that the "Disabling IRQ" never occurs when I booted a -default kernel with pollirq. While these kernels were running, I looked at /proc/interrupts a couple of times and noted there were always more IRQ7 interrupts than any other device... especially since this device is not being actively used. (There is not even a SD card in the reader and there are nearly always as many interrupts as the timer.) This pattern appeared no matter which of the above kernels was running. # cat /proc/interrupts CPU0 CPU1 0: 631500 42345 XT-PIC-XT timer 1: 45 348 IO-APIC-edge i8042 5: 0 2 IO-APIC-fasteoi ohci1394 7: 41585 628689 IO-APIC-fasteoi sdhci:slot0 8: 0 1 IO-APIC-edge rtc 9: 84 1112 IO-APIC-fasteoi acpi 12: 7624 2493 IO-APIC-edge i8042 14: 1 24 IO-APIC-edge libata 15: 0 0 IO-APIC-edge libata 20: 58 272987 IO-APIC-fasteoi eth0 21: 728 866 IO-APIC-fasteoi HDA Intel 22: 0 0 IO-APIC-fasteoi ohci_hcd:usb1, ehci_hcd:usb2 23: 39853 11734 IO-APIC-fasteoi sata_nv NMI: 838 734 LOC: 673356 673614 ERR: 0 I will blacklist sdhci and then see if a -default kernel will run without pollirq. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wednesday 22 August 2007 19:34:40 Warren Stockton wrote:
I will blacklist sdhci and then see if a -default kernel will run without pollirq.
Blacklisting sdhci eliminates the the 40%-90% that one of the cores will spend in hard interrupts (as per the 3 second snapshots in top). This increases %idle and decreases %hi with both kernel-vanilla-2.6.22.3-7 and kernel-default-2.6.22.3-7 but the kernel-default-2.6.22.3-7 will still lockup even when pollirq is used. Apart from missing support for apparmor, splash screens, etc., the kernel-vanilla-2.6.22.3-7 is working just fine without having to specify any extra boot parameters. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
without VOLUNTARY_PREEMPT,
VOLUNTARY_PREEMPT is quite unlikely to cause problems In fact it only preempts when the kernel could be preempted anyways. -Andi -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
participants (5)
-
Andi Kleen
-
Bernhard Walle
-
Jan Engelhardt
-
Joachim Deguara
-
Warren Stockton