On Wednesday 22 August 2007 14:00:39 Andi Kleen wrote:
For testing I built you a couple of kernel RPMS with various groups of patches disabled.
ftp://ftp.suse.com/pub/people/ak/test2/0 ... 4 [sync still running, might be up in a few minutes)
0 is the kernel with all patches for a control run, 1-4 are without ACPI, libata, ALSA, ieee1394 respectively.
Can you please test which kernel doesn't show the problem? 63df85b3deb2e413cd22d761f0e29dc2 0/kernel-default-2.6.22.4-20070822145631.x86_64.rpm Locked up after 7 min
8001b2044c30759eaa1cd0f156bf179b 1/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 14 mintes 3c2cc20694224819c32fb007af3fad83 2/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 15 minutes d758e90c49b033b79fa6a9a7eb479d7c 3/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 19 minutes 827eb27d1529ba70969ae63dbc91fd98 4/kernel-default-2.6.22.4-7.x86_64.rpm locked up after 26 minutes I don't think there is much value in how long each one ran before the lockup occurred. I have seen the lockups occur while executing boot scripts (which also happened with the '3 - no ALSA patch' kernel).
The netconsole output would be also still useful. I did the "insmod ...netconsole..." from /etc/init.d/boot.local (Wasted some time until I realized it did not need an insserv.) The only netconsole output captured for each of these kernels was the netconsole "device eth0 not up yet" and "Disabling IRQ #7" as below: netconsole: local port 4444 netconsole: local IP 192.168.... netconsole: interface eth0 netconsole: remote port 9353 netconsole: remote IP 192.168.... netconsole: remote ethernet address 00:03:47:23:e3:df netconsole: device eth0 not up yet, forcing it netconsole: carrier detect appears untrustworthy, waiting 4 seconds netconsole: network logging started Disabling IRQ #7
The "Disabling IRQ #7" always took a couple of minutes to appear and always preceeded the lockup by several minutes. I also noticed that the "Disabling IRQ" never occurs when I booted a -default kernel with pollirq. While these kernels were running, I looked at /proc/interrupts a couple of times and noted there were always more IRQ7 interrupts than any other device... especially since this device is not being actively used. (There is not even a SD card in the reader and there are nearly always as many interrupts as the timer.) This pattern appeared no matter which of the above kernels was running. # cat /proc/interrupts CPU0 CPU1 0: 631500 42345 XT-PIC-XT timer 1: 45 348 IO-APIC-edge i8042 5: 0 2 IO-APIC-fasteoi ohci1394 7: 41585 628689 IO-APIC-fasteoi sdhci:slot0 8: 0 1 IO-APIC-edge rtc 9: 84 1112 IO-APIC-fasteoi acpi 12: 7624 2493 IO-APIC-edge i8042 14: 1 24 IO-APIC-edge libata 15: 0 0 IO-APIC-edge libata 20: 58 272987 IO-APIC-fasteoi eth0 21: 728 866 IO-APIC-fasteoi HDA Intel 22: 0 0 IO-APIC-fasteoi ohci_hcd:usb1, ehci_hcd:usb2 23: 39853 11734 IO-APIC-fasteoi sata_nv NMI: 838 734 LOC: 673356 673614 ERR: 0 I will blacklist sdhci and then see if a -default kernel will run without pollirq. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org