[Bug 413842] New: booting xen hangs the system (cpu IERR)
https://bugzilla.novell.com/show_bug.cgi?id=413842 User schueffler@softgarden.de added comment https://bugzilla.novell.com/show_bug.cgi?id=413842#c389944 Summary: booting xen hangs the system (cpu IERR) Product: openSUSE 11.0 Version: Final Platform: x86-64 OS/Version: openSUSE 11.0 Status: NEW Severity: Critical Priority: P5 - None Component: Xen AssignedTo: cgriffin@novell.com ReportedBy: schueffler@softgarden.de QAContact: qa@suse.de Found By: --- Hi, after upgrading from 10.3 to 11 booting with xen does not work any more on my DELL 1950-III x86_64 server. I found these two bugs to be related anyhow, but they seem not to fit at 100% as they reference problems with network-card or driver - and i'm not sure if that fits into my error-profile: #389944 and #396236 There also exists a forum-post describing my problem: http://forums.opensuse.org/install-boot-login/390004-opensuse-11-x86_64-xen-... The hardware is a dell 1950-III with 16GB Ram and 2 quad-core Xeon-CPUs. As stated before, xen worked fine on the same hardware using 10.3 in dom0, and using 10.3 and windows-xp in dom-Us. After upgrading the dom0 to 11.0, booting to normal kernel is fine, but system hangs while booting to xen. The front-panel-lcd of this dell is starting to flash orange, and stating that CPU-2 crashed unrecoverable with IERR code E1420 and/or E1422. Only a cold reboot (unplug all power-cables) recovers from this error-state. I tried several boot-params for xen resulting in different crash-flavours: mem=1G (seems not to change anything) runlevel 1 or 3 or 5 (seems not to change anything) noirqbalance (crash is some lines of output later than without this param) I do have a serial cable and a second pc, but i do not know exaclty which info is related, and how to obtain these infos. So any guidance in this matter will be very helpful (in special: which program should i use on my second pc to gather and save the infos? the os is opensuse. which infos are of interest for you?). regards, and thank you for your help Stefan Schueffler -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
Charles Arnold
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c1
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c2
--- Comment #2 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c3
--- Comment #3 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c4
--- Comment #4 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c5
--- Comment #5 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c6
--- Comment #6 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c7
--- Comment #7 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c8
--- Comment #8 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c9
Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c10
--- Comment #10 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c11
--- Comment #11 from Jan Beulich
(And as expected, i can not boot my native kernel anymore since i do need the driver for accessing my hard-disks on the raid).
Why that? Did you also re-generate the native initrd?
How can i exclude the megaraid_sas from beeing integrated in the initrd?
I'm afraid you can do so only be renaming/moving away the respective module file(s). But as said before, I'd recommend trying to eliminate all post-initrd-loaded modules first, as that's easier to set up and may already provide the necessary information. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c12
--- Comment #12 from Stefan Schueffler
Why that? Did you also re-generate the native initrd?
Yes, i also regenerated the native initrd, just to test if i did the rebuild correctly. After debugging further, i am now able to boot the xen-system when i remove the mpt* - drivers from /lib/modules/2.6.25.11-0.1-xen/kernel/drivers/message/fusion/*. All is working fine except the hard-disk-access on the external raid system. Should i upload a dmesg-log of a successful xen-boot (without the mpt-driver)? To clarify the system: there are a few internal sata disks (handled by the linux scsi driver) containing the OS. Accessing these disks is fine. Additional, i use a LSI-20320-IE SCSI-320-Controller PCI-express driven by the mpt*-driver, and an external raid system attached to this card - which as expected i now can not access anymore. I also tried several combinations of disabling the BroadCom NetXtreme II network cards, DRAC-Management-Card etc, but it seems that the only way to get it booting is to disable the mpt-driver. Now, what can i do to further debug the problem with the mpt-driver? Booting a native kernel, sometimes i can see this message slipping on the boot-screen (nevertheless the system is booting fine): mptbase: ioc0: LogInfo(0x11010103): F/W: bug! MID not found Regards Stefan Schueffler -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c13
--- Comment #13 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c14
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c15
--- Comment #15 from Jan Beulich
I looked up the IERR codes and they mean: E1420: BIOS has reported a processor bus PERR (parity error). E1422: BIOS has reported a machine check error.
While I realize that this leaves open why the same doesn't happen on native (but of course it's obvious that Xen does things different from Linux in various places, which hardware is supposed to tolerate), there's nothing Xen can do to prevent these machine checks from happening - they indicate hardware problems. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
User mcowley@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c16
Mark Cowley
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c17
--- Comment #17 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c18
--- Comment #18 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User sandeep_k_shandilya@dell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c19
--- Comment #19 from Sandeep K. Shandilya
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c20
--- Comment #20 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User sandeep_k_shandilya@dell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c21
Sandeep K. Shandilya
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c22
--- Comment #22 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c23
Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c24
--- Comment #24 from Jan Beulich
Further debugging the xen-version of ioremap, the more precise failing code-position is in /usr/src/linux/arch/x86/mm/ioremap-xen.c -> in function ioremap_change_attr() -> failing line: err = set_memory_uc(vaddr, nrpages);
Yes, that was the primary suspect, and provides the hint that it may be worth trying the 11.0 hypervisor on a 10.3 installation (I would expect it to cause the same kind of failure). Doing the inverse experiment (10.3 hypervisor below 11.0 kernel) would require the kernel config to be changed slightly: CONFIG_XEN_COMPAT_030002_AND_LATER would need to be turned on, CONFIG_XEN_COMPAT_030004_AND_LATER should instead be turned off, and that should result in CONFIG_XEN_COMPAT=0x030002 (would need to be verified manually - past experience shows that this doesn't always get adjusted as intended, but I didn't get around to check why). The question of course is whether you can afford this amount of experimentation. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c25
--- Comment #25 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c26
--- Comment #26 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c27
--- Comment #27 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c28
--- Comment #28 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c29
Jan Beulich
Do i own defective xeon-cpu's / mainboard / bios, or are dell's server not capable of running modern linux-xen-distros? Or is there any chance, that it really is a server-bios or software-bug in either xen oder kernel implementation of ioremp (and its descendant functions)?
From Dell, I'd like to primarily find out how to get the machine to print out
From Intel, I'd like to find out if there is any similar issue known (I don't seem to be able to spot anything related in the specification updates), or what
Unfortunately it looks pretty likely at this point that this is a CPU issue, as from all we know so far the problematic mapping is the first ever established for that (bus) address range, and as it's uncachable there ought not to be anything in the caches for it, and hence (as indicated before) clflush should be a no-op. I'd really like to hear Dell's (Sandeep?) and ideally also Intel's (Marc?) opinion here. the machine check related MSR values, since the hypervisor doesn't appear to get control through an MCE here. the would recommend for further analysis. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c30
--- Comment #30 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c31
--- Comment #31 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c32
--- Comment #32 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User sandeep_k_shandilya@dell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c33
--- Comment #33 from Sandeep K. Shandilya
If you have any idea of how i can evaluate that, of how i can provide further info to eliminate this buggy behaviour, just give me a hint on what to look for. Do you think that i nevertheless should try new xen on old 10.3 and vice-versa?
You could also take a look at sel logs to see what happened just before the IERR. #chkconfig ipmi on #ipmitool sel list This will most likely reveal any hardware errors. you could also display the same by hitting ctrl-e at boot up -> "system event log menu" and navigate through the entries. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=413842
User mcowley@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c34
Mark Cowley
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c35
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c36
--- Comment #36 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jdouglas@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c37
--- Comment #37 from Jason Douglas
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c38
--- Comment #38 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c39
--- Comment #39 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c40
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c41
--- Comment #41 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c42
--- Comment #42 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c43
--- Comment #43 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c44
--- Comment #44 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User sandeep_k_shandilya@dell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c45
--- Comment #45 from Sandeep K. Shandilya
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c46
--- Comment #46 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User schueffler@softgarden.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c47
--- Comment #47 from Stefan Schueffler
https://bugzilla.novell.com/show_bug.cgi?id=413842
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c48
--- Comment #48 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=413842
User sandeep_k_shandilya@dell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c49
--- Comment #49 from Sandeep K. Shandilya
https://bugzilla.novell.com/show_bug.cgi?id=413842
User sandeep_k_shandilya@dell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=413842#c50
--- Comment #50 from Sandeep K. Shandilya
participants (1)
-
bugzilla_noreply@novell.com