[Bug 890702] New: pae kernel doesn't boot - appears to cause NMI when unpacking initramfs
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c0 Summary: pae kernel doesn't boot - appears to cause NMI when unpacking initramfs Classification: openSUSE Product: openSUSE 13.1 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: per@computer.org QAContact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:11.0) Gecko/20100101 Firefox/11.0 Hardware: Proliant DL580G2, HT, 4 CPUs, 32bit, 12Gb RAM. ("hamburg") Software: openSUSE 13.1+updates Process: PXE+ssh install from download.opensuse.org When trying to install, the initial boot kept failing. I hooked up a serial console, which showed the system generated an NMI apparently when unpacking initramfs. I then tried an installation with the -default kernel which worked fine. Reproducible: Always -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c1 Michal Hocko <mhocko@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mhocko@suse.com --- Comment #1 from Michal Hocko <mhocko@suse.com> 2014-08-07 09:42:38 CEST --- Do you have a full serial log? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c2 --- Comment #2 from Per Jessen <per@computer.org> 2014-08-07 07:50:42 UTC --- Created an attachment (id=601441) --> (http://bugzilla.novell.com/attachment.cgi?id=601441) serial log capture This is from a boot-up with maxcpus=0. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c3 --- Comment #3 from Michal Hocko <mhocko@suse.com> 2014-08-07 10:17:52 CEST ---
[ 0.000000] Kernel command line: BOOT_IMAGE=openSUSE root=/dev/disk/by-id/cciss-3600508b100184155435050384d320013-part1 noresume maxcpus=0 console=ttyS0,115200,8n1 [...] [ 1.459732] Failed to execute /init [ 1.463297] Kernel panic - not syncing: No init found. Try passing init= option to kernel. See Linux Documentation/init.txt for guidance.
I do not see initrd in the command line. Are you sure you don't need it? Also have you tried to follow recommendations from Documentation/init.txt (in the kernel source tree)? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c4 --- Comment #4 from Per Jessen <per@computer.org> 2014-08-07 09:18:22 UTC --- (In reply to comment #3)
[ 0.000000] Kernel command line: BOOT_IMAGE=openSUSE root=/dev/disk/by-id/cciss-3600508b100184155435050384d320013-part1 noresume maxcpus=0 console=ttyS0,115200,8n1 [...] [ 1.459732] Failed to execute /init [ 1.463297] Kernel panic - not syncing: No init found. Try passing init= option to kernel. See Linux Documentation/init.txt for guidance.
I do not see initrd in the command line. Are you sure you don't need it?
With the kernel that works, the initrd isn't mentioned in the command line either: Kernel command line: BOOT_IMAGE=openSUSE root=/dev/disk/by-id/cciss-3600508b100184155435050384d320013-part1 noresume Looking at other systems, there is also no initrd mentioned in the command line arguments.
Also have you tried to follow recommendations from Documentation/init.txt (in the kernel source tree)?
Uh no - when one kernel works, and the other doesn't, it seems quite clear. Is there anything specific in Documentation/init.txt you believe will help? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c5 --- Comment #5 from Michal Hocko <mhocko@suse.com> 2014-08-07 11:44:39 CEST --- (In reply to comment #4) [...]
Uh no - when one kernel works, and the other doesn't, it seems quite clear.
I do not see why pae should make any difference that early during the boot. So it doesn't sound entirely clear to me.
Is there anything specific in Documentation/init.txt you believe will help?
At least debug cmd option might tell us more. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c6 --- Comment #6 from Michal Hocko <mhocko@suse.com> 2014-08-07 12:04:10 CEST --- Ohh, wait a second. I have completely overlooked this: [ 0.952231] Unpacking initramfs... [ 0.958647] NMI: PCI system error (SERR) for reason b1 on CPU 0. [ 0.960017] Dazed and confused, but trying to continue Which means that there was a critical error reported by a PCI device. It seems that the error was fatal because recoverable errors are usually reported by MCE. I am not familiar with PCI internals enough to tell you details though. I have no idea what b1 as a reason means. I have also no idea why the issue is seen only with pae kernel (maybe the memory layout is slightly different) or maybe there is a BIOS bug which results in an overlapping areas or a weird PCI setup when PAE is enabled. Will try to look deeper into it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c7 --- Comment #7 from Per Jessen <per@computer.org> 2014-08-07 11:26:12 UTC --- (In reply to comment #5)
(In reply to comment #4) [...]
Uh no - when one kernel works, and the other doesn't, it seems quite clear.
I do not see why pae should make any difference that early during the boot. So it doesn't sound entirely clear to me.
I meant it is clearly a kernel issue, not related to the command line and not the initrd.
Is there anything specific in Documentation/init.txt you believe will help?
At least debug cmd option might tell us more.
Will do. Although doesn't that only affect the init processing? (which I never get to). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c8 --- Comment #8 from Michal Hocko <mhocko@suse.com> 2014-08-07 16:13:23 CEST --- In the initial comment you've said that the -default kernel boots just fine. Was that a 32b -default kernel? I suppose so but wanted to be sure. Also have you tried to boot 64b kernel on that machine? Finally have you ever tried to install different PAE kernels on that machine? E.g. the current upstream vanilla? There is a HEAD repository in build service where you can find it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c9 --- Comment #9 from Per Jessen <per@computer.org> 2014-08-07 14:58:20 UTC --- (In reply to comment #8)
In the initial comment you've said that the -default kernel boots just fine. Was that a 32b -default kernel? I suppose so but wanted to be sure.
Sorry, yes, that was the 32bit kernel-default package.
Also have you tried to boot 64b kernel on that machine?
No, the machine doesn't support 64bit.
Finally have you ever tried to install different PAE kernels on that machine? E.g. the current upstream vanilla? There is a HEAD repository in build service where you can find it.
I am not certain, but I am pretty certain I have had 12.1 running with -pae on this machine earlier. I'll try out some older kernels and see what happens. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c10 --- Comment #10 from Per Jessen <per@computer.org> 2014-08-07 15:43:33 UTC --- (In reply to comment #9)
I am not certain, but I am pretty certain I have had 12.1 running with -pae on this machine earlier. I'll try out some older kernels and see what happens.
Have just installed and booted with 3.1.0-1.2-pae, works fine. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c11 --- Comment #11 from Borislav Petkov <bpetkov@suse.com> 2014-08-07 16:27:41 UTC --- Ok, can you please boot both kernels - failing and working - with log_buf_len=16M ignore_loglevel debug initcall_debug bootmem_debug debug_objects early_ioremap_debug on the command line, catch output on serial and upload it? We need to somehow pinpoint in which direction we should be looking at. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c12 --- Comment #12 from Per Jessen <per@computer.org> 2014-08-07 16:39:07 UTC --- Have now booted with 3.7.10-1.1-pae, also works fine. I'll get the console logs to you later today. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c13 Borislav Petkov <bpetkov@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |per@computer.org --- Comment #13 from Borislav Petkov <bpetkov@suse.com> 2014-08-07 16:44:11 UTC --- Sounds like you didn't get the hw error this time, assuming it is a hw error this NMI reports. How reproducible is this issue? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c14 --- Comment #14 from Per Jessen <per@computer.org> 2014-08-07 16:50:19 UTC --- (In reply to comment #13)
Sounds like you didn't get the hw error this time, assuming it is a hw error this NMI reports. How reproducible is this issue?
I doubt if the NMI is a hardware error, but that's just my gut feeling. The issue is easily reproducable, except right now when I booted 3.11.10-17-pae with all the debug options you requested - this time it worked, no NMI. I'll upload both console logs in a minute, then try 3.11.10 again without the debug options. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c15 --- Comment #15 from Per Jessen <per@computer.org> 2014-08-07 16:54:41 UTC --- Created an attachment (id=601569) --> (http://bugzilla.novell.com/attachment.cgi?id=601569) console log from booting with 3.7.10-pae and debug options. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c16 --- Comment #16 from Per Jessen <per@computer.org> 2014-08-07 16:55:27 UTC --- Created an attachment (id=601570) --> (http://bugzilla.novell.com/attachment.cgi?id=601570) console log from booting with 3.11.10-pae and debug options. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c17 Per Jessen <per@computer.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW InfoProvider|per@computer.org | --- Comment #17 from Per Jessen <per@computer.org> 2014-08-07 16:58:02 UTC --- Okay, booting with 3.11.10-pae now appears to have started working. Suspicion - when I installed kernel-default, I noticed it automagically pulled in kernel-firmware. Presumably this means kernel-pae did not. Packaging issue? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c18 --- Comment #18 from Borislav Petkov <bpetkov@suse.com> 2014-08-07 19:02:44 UTC ---
I doubt if the NMI is a hardware error, but that's just my gut feeling.
The NMI is used to report a hw error.
I noticed it automagically pulled in kernel-firmware. Presumably this means kernel-pae did not. Packaging issue?
Do you start getting the error again if you forcibly remove kernel-firmware and reboot the PAE kernel? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c Borislav Petkov <bpetkov@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO InfoProvider| |per@computer.org -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=890702 https://bugzilla.novell.com/show_bug.cgi?id=890702#c19 Per Jessen <per@computer.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |CLOSED InfoProvider|per@computer.org | Resolution| |WORKSFORME --- Comment #19 from Per Jessen <per@computer.org> 2014-08-10 08:54:57 UTC --- (In reply to comment #18)
I doubt if the NMI is a hardware error, but that's just my gut feeling.
The NMI is used to report a hw error.
I noticed it automagically pulled in kernel-firmware. Presumably this means kernel-pae did not. Packaging issue?
Do you start getting the error again if you forcibly remove kernel-firmware and reboot the PAE kernel?
Removed kernel-firmware, rebooted, no problem. I mistook kernel-firmware for being the microcode updates, so I also removed ucode-intel, and rebooted. Again no problem. I am unable to reproduce the problem, I am closing for now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com