[Bug 647029] New: frequent oops at boot kernel 2.6.34.7-0.3
https://bugzilla.novell.com/show_bug.cgi?id=647029 https://bugzilla.novell.com/show_bug.cgi?id=647029#c0 Summary: frequent oops at boot kernel 2.6.34.7-0.3 Classification: openSUSE Product: openSUSE 11.3 Version: Final Platform: x86-64 OS/Version: openSUSE 11.3 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: richard.coe@med.ge.com QAContact: qa@suse.de Found By: --- Blocker: --- Created an attachment (id=395206) --> (http://bugzilla.novell.com/attachment.cgi?id=395206) z400 failure #1 User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.10) Gecko/20100914 SUSE/3.6.10-0.3.1 Firefox/3.6.10 GTB7.1 I have two failing systems running opensuse 11.3 with kernel update 2.6.34.7-0.3-desktop. I would appreciate any tips or suggestions for finding and elminating this failure. About 1 out every 10 or so reboots, I get an OOPS or BUG 12 seconds into the boot. Both systems have completed running stress testing at over 100 hours. I ran memtest86+ 4.00 and 4.10 and found no errors. Initially, the text mode console didn't allow to view the complete bug traceback. Enabling fbcon just made the lower half of the screen unreadable when the oops occurred. After enabling the tty console, we noticed the message just prior to the failure [ 8.987388] PM: Starting manual resume from disk I disabled the resume= parameter, changing it to noresume, to no effect on the frequency of the issue. Here is a summary of the BUG/OOPS for each system. I will attach the full boot messages for each failure. z400/inst.1:[ 12.452040] general protection fault: 0000 [#1] PREEMPT SMP z400/inst.2:[ 16.595201] BUG: Bad page map in process udevd pte:894c0000039cc581 pmd:1bbe79067 z400/inst.3:[ 11.972733] Oops: 0002 [#1] PREEMPT SMP z400/inst.3:[ 11.972822] BUG: scheduling while atomic: udevd/456/0x00000002 z400/inst.3:[ 11.972974] BUG: Bad page state in process udevd pfn:1b8de6 z400/inst.4:[ 11.912988] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a z400/inst.4:[ 11.912999] Oops: 0000 [#1] PREEMPT SMP z600/inst.1:[ 10.225249] kernel BUG at /usr/src/packages/BUILD/kernel-desktop-2.6.34.7/linux-2.6.34/kernel/timer.c:643! z600/inst.1:[ 10.254149] invalid opcode: 0000 [#1] PREEMPT SMP z600/inst.2:[ 11.318113] general protection fault: 0000 [#1] PREEMPT SMP z600/inst.3:[ 10.131683] kernel BUG at /usr/src/packages/BUILD/kernel-desktop-2.6.34.7/linux-2.6.34/kernel/timer.c:643! z600/inst.3:[ 10.160581] invalid opcode: 0000 [#1] PREEMPT SMP z600/inst.4:[ 11.332447] general protection fault: 0000 [#1] PREEMPT SMP z600/inst.5:[ 11.262224] general protection fault: 0000 [#1] PREEMPT SMP Reproducible: Sometimes Steps to Reproduce: 1. reboot. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c1
--- Comment #1 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c2
--- Comment #2 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c3
--- Comment #3 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c4
--- Comment #4 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c5
--- Comment #5 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c6
--- Comment #6 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c7
--- Comment #7 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c8
--- Comment #8 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c9
--- Comment #9 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c10
--- Comment #10 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c11
Jiri Slaby
We suspect that this do to some timing problem rather than a code issue.
Hmm, it looks like a nice example of heisenbug. Somebody heavily overwrites some memory. I don't see anything suspicious in the diff. Is it reproducible with kernel-vanilla-2.6.34.7? If yes, could you bisect it? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c12
--- Comment #12 from Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c13
--- Comment #13 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c14
--- Comment #14 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c15
Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c16
--- Comment #16 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c17
--- Comment #17 from Rich Coe
recorded panic count intervals z400 = 63, 70, 55, 151, 219, 291, 42 (C2 W3520 2.66Ghz, 3GB/6GB, SATA, BIOS 3.07) z600 = 41, 36, 15, 48, 2, 91, 6 (B3 1xX5550 2.66Ghz, 3GB/6GB, SATA, BIOS 3.10) z400-2 = 6018 reboots AOK (B3 W3520 2.66Ghz, 4GB, SATA, BIOS 3.07) z800 = 785 reboots AOK (B3, 2xE5530 2.40Ghz, 12GB, SAS, BIOS 3.07)
Chris has also generated boot panics or hangs on Sun Ultra 27 (Nehalem B3), Intel Westmere server (C2), and 8 Virtual Box VM's on Z800.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c18
--- Comment #18 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c19
--- Comment #19 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c20
--- Comment #20 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c21
--- Comment #21 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c22
--- Comment #22 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c23
--- Comment #23 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c24
--- Comment #24 from Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c25
--- Comment #25 from Rich Coe
CONFIG_ACPI_AC=y CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_FAN=y CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_THERMAL=y
473d480 < CONFIG_ACPI_CUSTOM_DSDT_FILE="" < CONFIG_ACPI_CUSTOM_OVERRIDE_INITRAMFS=y 477,479c483,484 < CONFIG_ACPI_DEBUG=y < # CONFIG_ACPI_DEBUG_FUNC_TRACE is not set < CONFIG_ACPI_PCI_SLOT=m ---
# CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_PCI_SLOT=y
481c486 < CONFIG_ACPI_CONTAINER=m ---
CONFIG_ACPI_CONTAINER=y
496,497c501,502 < CONFIG_CPU_FREQ_TABLE=y < # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set < CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y < CONFIG_CPU_FREQ_GOV_USERSPACE=m < CONFIG_CPU_FREQ_GOV_ONDEMAND=y ---
CONFIG_CPU_FREQ_TABLE=m CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=m
508c513 < # CONFIG_X86_PCC_CPUFREQ is not set ---
CONFIG_X86_PCC_CPUFREQ=m
512c517 < # CONFIG_X86_P4_CLOCKMOD is not set ---
CONFIG_X86_P4_CLOCKMOD=m
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c26
Jiri Slaby
I've bisected the kernel params down to these few: Left is original suse, Right is the working version.
Thanks for doing that. Could you also attach lsmod output from both kernels to kill off those which are not loaded at all? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c27
Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c28
--- Comment #28 from Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c29
--- Comment #29 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c30
--- Comment #30 from Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c31
Jiri Slaby
I am able to create a gpf with this simple script:
The question is whether it exposes the overwritten memory or they overwrite the memory themselves. If you blacklist/remove those modules and run the reboot test, does it appear? Or what happens if you try the loop only with a single module? And with totally different modules? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c32
Rich Coe
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c33
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c34
--- Comment #34 from Thomas Renninger
If you get a hit, it would be interesting whether blacklisting the container driver... it's not that.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c35
--- Comment #35 from Bjorn Helgaas
https://bugzilla.novell.com/show_bug.cgi?id=647029 https://bugzilla.novell.com/show_bug.cgi?id=647029#c36 Thomas Renningerchanged: What |Removed |Added ---------------------------------------------------------------------------- Summary|nehalem: memory corruption |nehalem: memory corruption |2.6.34.7-0.3 |2.6.34.7-0.3 - may be | |related to ACPI button.ko | |driver --- Comment #36 from Thomas Renninger 2011-01-12 22:16:07 UTC --- > Every failure correlates very closely with loading the ACPI button driver Not every failure. It looks as if we have two issues. I do not have the button.ko issue. In fact button.ko is not even yet loaded on my system. Same for segfaults shown in comment #6 and #9 (search for button in there you won't get a hit). Unfortunately Rich's backtraces are somewhat cut off, but they are very equal to the one I posted in comment #33: - bad process: "comm: stapio" - "Bug on" triggered in kernel/timer.c: (line 681 for me, 2.6.37 and line 643 for Rich, 2.6.34) I expect my issue is related to preloadtrace.ko. I'll open another bug and will point to it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c37
--- Comment #37 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c38
--- Comment #38 from Youquan Song
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c39
--- Comment #39 from Bjorn Helgaas
So my guess is that bug 664105 is the main problem, but there seems to be something else going wrong too. I fear it's not bug 664105. This one took care about a NULL pointer dereference because of a race caused by SystemTap produced kernel code (at least that is what the changelog of the patch I added said which perfectly hit my and some of
https://bugzilla.novell.com/show_bug.cgi?id=647029
https://bugzilla.novell.com/show_bug.cgi?id=647029#c40
Thomas Renninger
participants (1)
-
bugzilla_noreply@novell.com