[Bug 675161] New: Kernel boot Panic / Crash & NULL pointer dereference
https://bugzilla.novell.com/show_bug.cgi?id=675161 https://bugzilla.novell.com/show_bug.cgi?id=675161#c0 Summary: Kernel boot Panic / Crash & NULL pointer dereference Classification: openSUSE Product: openSUSE 11.4 Version: RC 2 Platform: x86-64 OS/Version: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: graham@andtech.eu QAContact: qa@suse.de Found By: --- Blocker: --- Created an attachment (id=416329) --> (http://bugzilla.novell.com/attachment.cgi?id=416329) Point of failure of installation kernel (x86_64 RC2 DVD) User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.21 SUSE/11.0.674.0 (KHTML, like Gecko) Chrome/11.0.674.0 Safari/534.21 The 11.4 RC2 installation kernel crashes at the same point on two different machines I have tested it on. These machines have the same CPU and mainboard as detailed in the following smolt profile, but different RAM modules: http://www.smolts.org/client/show/pub_1a849a4e-cf37-4b09-92a6-304e1f8d9968 Additionally, on one machine, I performed a zypper dup to RC1 a couple of days ago, and if I try to boot to the 11.4 kernels I experienced crashes and hangs at multiple different stages of the boot process. I was able to fully boot to runlevel 3 only once, but the machine very quickly hung after trying to login and before login was complete. I am unable to attach a serial console but I have taken a series of pictures of the various kernel failures and also saved /var/log/messages from the single instance where I was able to boot to runlevel 3 I will attach the photo of the installation kernel failure to this initial report. I will add the additional photo's of boot failures and the messages log in subsequent comments. Reproducible: Always -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c1
--- Comment #1 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c2
--- Comment #2 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c3
--- Comment #3 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c4
--- Comment #4 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c5
--- Comment #5 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c6
--- Comment #6 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c7
--- Comment #7 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c8
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c9
Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c10
--- Comment #10 from Jiri Slaby
Using kernel 2.6.37.2-1-default, I checked the initrd and the microcode module is not there. I blacklisted microcode in modprobe.d and also manually moved the microcode.ko file out of the modules tree.
I see the same re-occuring boot crashes/hangs as per the previous attachments.
Well, so it's some other module or kernel part causing this. It might be worth trying to feed the output through mcelog.
Additionally, when I previously saw the MCE panics, I made sure to boot to memtest on both machines, I quit out of memtest after 4 pass cycles.
I don't think it's a HW problem as I saw some of those already. But who knows. Did it happen for example with 11.3 or other distros (if you tried it)? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c
Jiri Slaby
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c11
Graham Anderson
Well, so it's some other module or kernel part causing this. It might be worth trying to feed the output through mcelog.
I will attempt to do this, I'm currently waiting on an rs232 cable arriving so I can boot with a serial console. Curiously, I recall seeing "failed" messages beside mcelog when trying to boot with the 2.6.37-desktop and 2.6.37.1-destkop kernels in RC1/2
Did it happen for example with 11.3 or other distros (if you tried it)?
I didn't have any issues with the 2.6.34 kernels from 11.3, I've not had time or cause to install another distro but will maybe try to find time. I'm not sure what other distros have been recently released that would be using 2.6.37 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c12
--- Comment #12 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c13
--- Comment #13 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c14
--- Comment #14 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c15
--- Comment #15 from Jiri Slaby
I have the following results for 2.6.37.1
pass only: intel_idle.max_cstate=0 result: System boots normally and behaves as expected with no further issues
Ok, so revert of:
commit 0f212b87548cc4598fb7c77d92bfef23d5ee4d1a
Author: Shaohua Li
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c16
Jiri Slaby
build a kernel with those reverted to test.
Could you test: http://labs.suse.cz/jslaby/bug-675161/ ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c17
Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c18
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c19
--- Comment #19 from Thomas Renninger
To get the initial value of lapic_timer_reliable_states on your machine you can boot with these params: Best attach whole dmesg with this params booted.
Do you remember which kernel was the last one working? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c20
--- Comment #20 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c21
--- Comment #21 from Len Brown
Point of failure of installation kernel (x86_64 RC2 DVD)
That one sure looks like a hardware error... Hmm, two systems failed with the same Fatal Machine Check? There are three, apparently independent, crashes in this bug report...
intel_idle.max_cstate=0
How about intel_idle.max_cstate=1 or intel_idle.max_cstate=2 or intel_idle.max_cstate=3 please attache the complete dmesg for a boot with intel_idle driver loaded.
pass only: hpet=disable result: system sometimes boots, but completely locks up soon after
pleaes attach the complete dmesg from one of the boots that makes it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c22
--- Comment #22 from Thomas Renninger
Hmm, two systems failed with the same Fatal Machine Check? Wrong TLB entries can result in machine check exceptions, I've seen this at least on an AMD machine already. Therefore my current guess is that it may be related to that. It's a guess only, but giving the test described in comment #20 a try would be great for verification.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c23
--- Comment #23 from Graham Anderson
Do you remember which kernel was the last one working?
From 11.3 so I guess 2.6.34.7
I've been out of town, I'll make some time before the weekend to follow your suggestions/kernel build from comment #18 and comment #20 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c24
--- Comment #24 from Jiri Slaby
I've been out of town, I'll make some time before the weekend to follow your suggestions/kernel build from comment #18 and comment #20
Any updates? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c25
--- Comment #25 from Tim Manchester
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c26
--- Comment #26 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c27
--- Comment #27 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c28
--- Comment #28 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c29
--- Comment #29 from Graham Anderson
Did someone ever try a desktop instead a default kernel? This could be related to: Bug 672008 - [i915, mtrr] Complete system freeze at start While there it's a real deadlock, the wrong page states could be related to wrong mtrr settings. And the fact that -default kernel is used very much sounds like above bug.
kernel 2.6.37.6-0.5-desktop and 3.1.0-rc6-2-desktop exhibits the same problems
If desktop kernel works, default should also work again with a recent kernel from here: ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/x86_64
I am unable to browse/cd anonymously to that location to try a KOTD -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c30
--- Comment #30 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c31
--- Comment #31 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c32
Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c33
Thomas Renninger
boot with: intel_idle.max_cstate=0 OR boot with: intel_idle.max_cstate=1 result: success always, see attached dmesg
boot only: intel_idle.max_cstate=2 or boot with: intel_idle.max_cstate=3 result: panic
With intel_idle.max_cstate=0 the acpi idle driver should get used. If this one works, the reason might be that we should ignore a specific idle state on purpose? Can you attach acpidump and dmidecode and run: cpupower idle-info (with intel_idle.max_cstate=0 boot param set) May need zypper install cpupower, package did not exist on 11.4 yet. Have you also already tried Nvidia's binary graphics driver? If not, it would be great if you could give it a try. Memory accesses which cause MCEs could likely be graphics driver related. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c34
--- Comment #34 from Len Brown
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c35
--- Comment #35 from Graham Anderson
Can you attach acpidump and dmidecode and run: cpupower idle-info (with intel_idle.max_cstate=0 boot param set) May need zypper install cpupower, package did not exist on 11.4 yet.
mercury:/ # grep . /sys/devices/system/cpu/cpuidle/* /sys/devices/system/cpu/cpuidle/current_driver:acpi_idle /sys/devices/system/cpu/cpuidle/current_governor_ro:menu mercury:/ # cpupower idle-info CPUidle driver: acpi_idle CPUidle governor: menu Analyzing CPU 0: CPU 0: No idle states
Have you also already tried Nvidia's binary graphics driver? If not, it would be great if you could give it a try. Memory accesses which cause MCEs could likely be graphics driver related.
With nouveau blacklisted (and removed from initrd) and using the Nvidia blobs MCE panics still appear in all previously problematic boot configurations, both with kernels 2.6.37.6 for 11.4 and 3.x kernels for 11.4 and 12.1 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c36
--- Comment #36 from Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c37
Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c38
--- Comment #38 from Graham Anderson
boot with: intel_idle.max_cstate=0 OR boot with: intel_idle.max_cstate=1 result: success always, see attached dmesg
boot only: intel_idle.max_cstate=2 or boot with: intel_idle.max_cstate=3 result: panic
With intel_idle.max_cstate=0 the acpi idle driver should get used. If this one works, the reason might be that we should ignore a specific idle state on purpose?
Can you attach acpidump and dmidecode and run: cpupower idle-info (with intel_idle.max_cstate=0 boot param set) May need zypper install cpupower, package did not exist on 11.4 yet.
FYI successful boot with intel_idle.max_cstate=1 cpupower idle-info CPUidle driver: intel_idle CPUidle governor: menu Analyzing CPU 0: Number of idle states: 2 Available idle states: C1-NHM C1-NHM: Flags/Description: MWAIT 0x00 Latency: 3 Usage: 107632 Duration: 1466411496 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c39
--- Comment #39 from Len Brown
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c40
--- Comment #40 from Graham Anderson
so with intel_idle.max_cstate=0, there is nothing under /sys/devices/system/cpu/cpu0/cpuidle ?
Yes nothing, with intel_idle.max_cstate=0 cpuidle only appears under /sys/devices/system/cpu/cpuidle mercury:~ # tree /sys/devices/system/cpu/cpuidle /sys/devices/system/cpu/cpuidle ├── current_driver └── current_governor_ro mercury:~ # tree /sys/devices/system/cpu/cpu0 /sys/devices/system/cpu/cpu0 ├── cache │ ├── index0 │ │ ├── coherency_line_size │ │ ├── level │ │ ├── number_of_sets │ │ ├── physical_line_partition │ │ ├── shared_cpu_list │ │ ├── shared_cpu_map │ │ ├── size │ │ ├── type │ │ └── ways_of_associativity │ ├── index1 │ │ ├── coherency_line_size │ │ ├── level │ │ ├── number_of_sets │ │ ├── physical_line_partition │ │ ├── shared_cpu_list │ │ ├── shared_cpu_map │ │ ├── size │ │ ├── type │ │ └── ways_of_associativity │ ├── index2 │ │ ├── coherency_line_size │ │ ├── level │ │ ├── number_of_sets │ │ ├── physical_line_partition │ │ ├── shared_cpu_list │ │ ├── shared_cpu_map │ │ ├── size │ │ ├── type │ │ └── ways_of_associativity │ └── index3 │ ├── coherency_line_size │ ├── level │ ├── number_of_sets │ ├── physical_line_partition │ ├── shared_cpu_list │ ├── shared_cpu_map │ ├── size │ ├── type │ └── ways_of_associativity ├── crash_notes ├── microcode │ ├── processor_flags │ ├── reload │ └── version ├── node0 -> ../../node/node0 ├── thermal_throttle │ └── core_throttle_count └── topology ├── core_id ├── core_siblings ├── core_siblings_list ├── physical_package_id ├── thread_siblings └── thread_siblings_list However, if I successfully boot with intel_idle.max_cstate=1 then info _does_ appear under /sys/devices/system/cpu/cpu0/cpuidle
Is the default BIOS SETUP configuration being used?
No, memory profile is manually configured. CPU, QPI, BCLK & PCI bus are all set to Auto. No voltage tweaks.
Are there any BIOS SETUP options for processor power management related to C-states?
The following options are available and are set as follows, CPU Enhanced Halt(C1E): Auto C3/C6/C7 State Support: Auto CPU EIST Function: Auto Other options in the same section are: Intel(R) Turbo Boost Tech.: Auto CPU Cores Enabled: All CPU Multi-Threading: Enable Bi-Directional PROCHOT: Enable I will reset BIOS to default and/or fail-safe and try again. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c41
Graham Anderson
https://bugzilla.novell.com/show_bug.cgi?id=675161
https://bugzilla.novell.com/show_bug.cgi?id=675161#c42
Graham Anderson
participants (1)
-
bugzilla_noreply@novell.com