[Bug 599147] New: dom0 crashes with hvm domUs
http://bugzilla.novell.com/show_bug.cgi?id=599147 http://bugzilla.novell.com/show_bug.cgi?id=599147#c0 Summary: dom0 crashes with hvm domUs Classification: openSUSE Product: openSUSE 11.1 Version: Final Platform: x86-64 OS/Version: openSUSE 11.1 Status: NEW Severity: Major Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: koenig@linux.de QAContact: qa@suse.de Found By: --- Blocker: --- for the first time I try to use XEN with some HVM domUs (so far all 20-30 machines have been PVM which works quite fine). now I sometimes run HVM guests with different distros (debian4/centos4/... 32 and 64 bit) and so far this caused 5-6 dom0 crashes :-( is this a known problem ? any fixes/workarounds I can try ? how can I get some more information about the dom0 crashes ? serial console for the dom0 server is not possible... # rpm -q kernel-xen xen xen-libs kernel-xen-2.6.31.5-0.1.1.x86_64 xen-3.4.1_19718_04-2.1.x86_64 xen-libs-3.4.1_19718_04-2.1.x86_64 -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c1
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c2
Harald Koenig
Please clarify whether you use 11.1 or 11.2, update your Dom0 kernel,
oops, my fault! that kernel/xen info was from the wrong server! this is the correct version info for the 11.1 xen server, all updates installed: # rpm -q kernel-xen xen xen-libs kernel-xen-2.6.27.45-0.1.1 xen-3.3.1_18546_20-0.1.1 xen-libs-3.3.1_18546_20-0.1.1 [I'm testing 11.2 as xen server too -- but those are different problems, see bug #599789 -- -- I'm trying to test 11.2 and 11.3-factory exactly because of those 11.1 HVM problems...)
and (if that didn't help) provide some sort of technical information (if serial is impossible and you can't reproduce this on another machine where you have serial, screen shots of the crash time output from hypervisor or kernel will be necessary). Otherwise we have no data to work with.
agreed... I'll try to setup some serial console support using IPMI with SOL support. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c3
Jan Beulich
I'll try to setup some serial console support using IPMI with SOL support.
Restoring needinfo. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c4
Harald Koenig
I'll try to setup some serial console support using IPMI with SOL support.
Restoring needinfo.
ok, the good news: the serial console for xen/dom0 works via ipmi/sol! the bad news: I did not manage to crash the machine again using one (more) hvm client(s) running the same compile benchmark as before :-( but anyway I got some xen messages which *might* be helpful to you to give you a clue what's going on^H^Hwrong anyway. the client to be tested is called "os-centos4u4" which is running 32bit centos 4u4 (os2-* clients are 64 bit, os-* are 32 bit...) 1st test (before full reboot of dom0 for a 2n try), domU disks are on/from a remote iscsi server: os-centos4u4 runs as dom31 and uses all 4 CPUs (2*dual-core xeon), 1 GB ram (phys 16 GB, 4GB left for dom0 right now). running a large compile job with "make -j6 -l8" we se many of those msgs: (XEN) mm.c:767:d31 Error getting mfn 7f2e (pfn 3eb28c) from L1 entry 0000000007f2e063 for dom31 (XEN) printk: 382 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 18ad6 (pfn 3bea2) (XEN) printk: 400 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 7f7b (pfn 3eb2d9) (XEN) printk: 283 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 6800000000000001 != exp e000000000000000) for mfn 18aa2 (pfn 3bed6) (XEN) printk: 276 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 31cde (pfn 2f24e) (XEN) printk: 304 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 196f7 (pfn 3b281) (XEN) printk: 366 messages suppressed. (XEN) mm.c:767:d31 Error getting mfn 7f22 (pfn 3eb280) from L1 entry 0000000007f22063 for dom31 (XEN) printk: 385 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 18ad8 (pfn 3bea0) (XEN) printk: 241 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 31cde (pfn 2f24e) (XEN) printk: 329 messages suppressed. (XEN) mm.c:2270:d31 Bad type (saw 2800000000000001 != exp e000000000000000) for mfn 7f7b (pfn 3eb2d9) (XEN) printk: 291 messages suppressed. (XEN) mm.c:767:d31 Error getting mfn 7f1f (pfn 3eb27d) from L1 entry 0000000007f1f063 for dom31 (XEN) printk: 145 messages suppressed. (XEN) mm.c:767:d31 Error getting mfn 7f22 (pfn 3eb280) from L1 entry 0000000007f22063 for dom31 (XEN) printk: 154 messages suppressed. I found this thread for those msgs -- but at least for me it was mo realy help ;-) http://lists.xensource.com/archives/html/xen-devel/2010-04/msg00777.html so I rebootet the whole server (dom0), all domUs shutdown/restared. this time the os-centos4u4 disks are local image files on the dom0 file system (as it has been for at least 2 crashes while benchmaking sw-builds with different xen setups). now while running make/gcc/g++ on os-centos4u4 there are no xen msgs anymore -- and no dom0 crash so far since this morning :-( BUT: at domU boot time I got the follow xen msg (full "xm dmesg" attached...) (XEN) mm.c:767:d6 Error getting mfn 90c47 (pfn 3623a5) from L1 entry 0000000090c47061 for dom6 (XEN) traps.c:466:d6 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000] (XEN) domain_crash_sync called from entry.S (XEN) Domain 6 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-3.3.1_18546_20-0.1.1 x86_64 debug=n Not tainted ]---- where dom6 is the "1st startup" of os-suse111 -- this is "xm list" after reboot (os-suse111 got up as dom7 -- big surprise for me;) # xm lis Name ID Mem VCPUs State Time(s) Domain-0 0 3823 4 r----- 6139.4 os-centos3u6 1 1024 4 -b---- 16.3 os-centos4u4 2 1024 4 r----- 25609.6 os-centos5 3 1024 4 -b---- 33.9 os-debian40 4 1024 4 -b---- 17.9 os-sles11 5 1024 4 -b---- 21.7 os-suse111 7 1024 4 -b---- 22.3 os2-centos3u7 8 1024 4 -b---- 18.0 os2-centos4u4 9 1024 1 -b---- 527.0 os2-centos5 10 1024 4 -b---- 37.6 os2-debian40 11 1024 4 -b---- 18.1 os2-sles11 12 1024 4 -b---- 25.4 os2-suse111 13 1024 4 -b---- 24.7 from xend.log -- it shows that the immediate "restart" of os-suse111 as dom 7 (after dom 6 had crashed) finally worked: [2010-04-28 11:25:41 4877] INFO (XendDomain:1175) Domain os-suse111 (6) unpaused. [2010-04-28 11:25:41 4877] WARNING (XendDomainInfo:1645) Domain has crashed: name=os-suse111 id=6. [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:2446) XendDomainInfo.destroy: domid=6 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1971) Destroying device model [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1978) Releasing devices [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1991) Removing vif/0 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:921) XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1991) Removing console/0 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:921) XendDomainInfo.destroyDevice: deviceClass = console, device = console/0 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1991) Removing vbd/768 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:921) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/768 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1991) Removing vbd/832 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:921) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/832 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1976) No device model [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:1978) Releasing devices [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:113) XendDomainInfo.create_from_dict({'vcpus_params': {'cap': 0, 'weight': 256}, 'PV_args': 'root=/dev/hda2', 'other_config': {}, 'features': '', 'cpus': [[], [], [], []], 'paused': 0, 'domid': 6, 'vcpu_avail': 15, 'VCPUs_live': 1, 'PV_bootloader': '/usr/lib/xen/boot/domUloader.py', 'actions_after_crash': 'restart', 'vbd_refs': ['0b5b3f14-4392-f143-66de-32f7892e2987', '2bb547ac-9e2d-9421-f4f9-a23a239eca23'], 'PV_ramdisk': '', 'is_control_domain': False, '_temp_ramdisk': '/var/lib/xen/tmp/ramdisk.PwU6cM', 'name_label': 'os-suse111', 'VCPUs_at_startup': 1, 'HVM_boot_params': {}, 'platform': {}, 'PV_kernel': '', 'console_refs': ['989fba9e-6019-0fb2-fb3a-d76928b5e02a'], 'online_vcpus': 1, 'vif_refs': ['ac0b86d0-b951-5aae-e3d3-6ac2cd5b7edc'], 'blocked': 0, 'on_xend_stop': 'ignore', 'shutdown': 0, 'HVM_boot_policy': '', 'shutdown_reason': 3, 'VCPUs_max': 4, 'start_time': 1272446739.491657, 'memory_static_max': 2147483648, 'actions_after_shutdown': 'destroy', 'on_xend_start': 'ignore', 'crashed': 0, 'memory_dynamic_max': 1073741824, 'actions_after_suspend': '', 'is_a_template': False, 'memory_dynamic_min': 1073741824, '_temp_args': 'root=/dev/hda2', 'cpu_time': 0.000237376, 'shadow_memory': 0, 'memory_static_min': 0, 'dying': 0, 'PV_bootloader_args': '--entry=hda2:/boot/vmlinuz-xen,/boot/initrd-xen', 'notes': {'HV_START_LOW': 4118806528, 'FEATURES': 'writable_page_tables|writable_descriptor_tables|auto_translated_physmap|pae_pgdir_above_4gb|supervisor_mode_kernel', 'VIRT_BASE': 3221225472, 'GUEST_VERSION': '2.6', 'PADDR_OFFSET': 0, 'GUEST_OS': 'linux', 'HYPERCALL_PAGE': 3222278144, 'LOADER': 'generic', 'SUSPEND_CANCEL': 1, 'PAE_MODE': 'yes', 'ENTRY': 3222274048, 'XEN_VERSION': 'xen-3.0'}, '_temp_kernel': '/var/lib/xen/tmp/kernel.IsD6mr', 'uuid': '21309d59-c939-48e6-e4b1-9c8b8dd9d0e2', 'actions_after_reboot': 'restart', '_temp_using_bootloader': '1', 'target': 0, 'running': 0, 'vtpm_refs': [], 'devices': {'ac0b86d0-b951-5aae-e3d3-6ac2cd5b7edc': ('vif', {'bridge': 'br0', 'mac': '00:0c:29:a6:33:18', 'devid': 0, 'model': 'rtl8139', 'uuid': 'ac0b86d0-b951-5aae-e3d3-6ac2cd5b7edc'}), '2bb547ac-9e2d-9421-f4f9-a23a239eca23': ('vbd', {'uuid': '2bb547ac-9e2d-9421-f4f9-a23a239eca23', 'bootable': 0, 'devid': 832, 'driver': 'paravirtualised', 'dev': 'hdb', 'uname': 'file:/etc/xen/images/os-suse111_builddisk-flat.vmdk', 'mode': 'w'}), '0b5b3f14-4392-f143-66de-32f7892e2987': ('vbd', {'uuid': '0b5b3f14-4392-f143-66de-32f7892e2987', 'bootable': 1, 'devid': 768, 'driver': 'paravirtualised', 'dev': 'hda', 'uname': 'file:/etc/xen/images/os-suse111-flat.vmdk', 'mode': 'w'}), '989fba9e-6019-0fb2-fb3a-d76928b5e02a': ('console', {'other_config': {}, 'protocol': 'vt100', 'uuid': '989fba9e-6019-0fb2-fb3a-d76928b5e02a', 'location': '2'})}}) [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:2068) XendDomainInfo.constructDomain [2010-04-28 11:25:41 4877] DEBUG (balloon:151) Balloon: 493372 KiB free; need 2048; done. [2010-04-28 11:25:41 4877] DEBUG (XendDomain:450) Adding Domain: 7 [2010-04-28 11:25:41 4877] DEBUG (XendDomainInfo:2232) XendDomainInfo.initDomain: 7 256 [2010-04-28 11:25:41 8291] DEBUG (XendBootloader:117) Launching bootloader as ['/usr/lib/xen/boot/domUloader.py', '--args=root=/dev/hda2', '--output=/var/run/xend/boot/xenbl.24257', '--entry=hda2:/boot/vmlinuz-xen,/boot/initrd-xen', '/etc/xen/images/os-suse111-flat.vmdk']. [2010-04-28 11:25:42 4877] DEBUG (XendDomainInfo:2262) _initDomain:shadow_memory=0x0, memory_static_max=0x80000000, memory_static_min=0x0. [2010-04-28 11:25:42 4877] DEBUG (balloon:151) Balloon: 1061832 KiB free; need 1057280; done. [2010-04-28 11:25:42 4877] INFO (image:166) buildDomain os=linux dom=7 vcpus=4 [2010-04-28 11:25:42 4877] DEBUG (image:642) domid = 7 [2010-04-28 11:25:42 4877] DEBUG (image:643) memsize = 1024 [2010-04-28 11:25:42 4877] DEBUG (image:644) image = /var/lib/xen/tmp/kernel.iQDwuu [2010-04-28 11:25:42 4877] DEBUG (image:645) store_evtchn = 1 [2010-04-28 11:25:42 4877] DEBUG (image:646) console_evtchn = 2 [2010-04-28 11:25:42 4877] DEBUG (image:647) cmdline = root=/dev/hda2 maybe this 1st crash for the 32bit 11.1 pvm client can be a hint for my problem in bug #599789 starting exactly this pvm domU with 11.2/11.3 ?!? I'll see once I can test 11.2/11.3 server again now with this console log (now knowing that before I should have looked at least into "xm dmesg" ;-) unfortuneately I'm now off for one week for a conference. likely I'll be online sometimes and can give more data, but I won't be able (by policy -- not for techinical reasons anymore thanks to IPMI;-)) to run any tests or reboot while being "remote"... feel free to set to "NEEDINFO" again -- I'll report any information about crashes as soon as they are available (but for the next week everthing will run as PVM so very likely it's all rock stable...) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c5
--- Comment #5 from Harald Koenig
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c6
--- Comment #6 from Harald Koenig
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c7
--- Comment #7 from Harald Koenig
BUT: at domU boot time I got the follow xen msg (full "xm dmesg" attached...)
gaaa -- the attachments went to the "wrong" window for bug #599789, so I'll attach them again (for the future: is it possible to assiciate one attachment with two bug ids ?) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c8
--- Comment #8 from Jan Beulich
but anyway I got some xen messages which *might* be helpful to you to give you a clue what's going on^H^Hwrong anyway.
No - as long as only foreign (non-SuSE) guests cause these messages, we'll have to direct you to the provider of those kernels. These messages indicate something wrong in their kernel (seems like pages used as page tables don't get cleaned up properly).
(XEN) mm.c:767:d6 Error getting mfn 90c47 (pfn 3623a5) from L1 entry 0000000090c47061 for dom6 (XEN) traps.c:466:d6 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000] (XEN) domain_crash_sync called from entry.S (XEN) Domain 6 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-3.3.1_18546_20-0.1.1 x86_64 debug=n Not tainted ]----
Still - not enough info for analysis (you cut off the register/stack dump, which is the really important part if we want to understand what causes those crashes - with the above I can only guess that the guest kernel hit a BUG() somewhere). I'd suggest just providing the full log collected over serial. With you stating that you can't reproduce the original behavior anymore, I also wonder how reproducible your problems are in general (both with regard to successive sessions on the same machine and between different machines). And please attach logs or more-than-a-few-lines fragments of them rather than putting them inline. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c9
Jan Beulich
Created an attachment (id=357397) --> (http://bugzilla.novell.com/attachment.cgi?id=357397) [details] xm dmesg output
The faulting address (e019:c01154cd) doesn't correspond to any instruction boundary in 2.6.27.45-0.1. Are you sure your 11.1 guest is fully updated? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c10
Harald Koenig
The faulting address (e019:c01154cd) doesn't correspond to any instruction boundary in 2.6.27.45-0.1. Are you sure your 11.1 guest is fully updated?
I'm sure it's not! it's the raw installation of the release DVD, *no* upates installed: # rpm -qa kernel-xen\* kernel-xen-base-2.6.27.7-9.1 kernel-xen-extra-2.6.27.7-9.1 kernel-xen-2.6.27.7-9.1 typically for sw build machines we try not to install any updates to be upwards compatible with everyone -- customers may not have any/all updates and from time to time we got bitten by incompatible (not downward-compatible) updates so that our sw did not run anymore on machines with less updates... talking about the kernel-xen an update shouldn't be a problem here if this can solve a "known problem" ... I'll update next week and report again... -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c11
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c12
--- Comment #12 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=599147
http://bugzilla.novell.com/show_bug.cgi?id=599147#c13
--- Comment #13 from Harald Koenig
Ping?
I'm "offline" right now in holyday (and next week there I'm at LinuxTag in Berlin). my last status: there was one more dom0 cash when trying to start WinXP (hvm). unfortuneately at that time there was no log running for serial console:-( now I log the serial all time -- and there was no more crash so far:( I'll keep you informed (and will leave the NEEDINFO for now;)... -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=599147
https://bugzilla.novell.com/show_bug.cgi?id=599147#c14
Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com