[Bug 687368] New: Loop device gets stuck when using vm-install to create Xen domU
https://bugzilla.novell.com/show_bug.cgi?id=687368 https://bugzilla.novell.com/show_bug.cgi?id=687368#c0 Summary: Loop device gets stuck when using vm-install to create Xen domU Classification: openSUSE Product: openSUSE 11.4 Version: Factory Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: jfunk@funktronics.ca QAContact: qa@suse.de Found By: --- Blocker: --- Created an attachment (id=424801) --> (http://bugzilla.novell.com/attachment.cgi?id=424801) /var/log/messages User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.29 SUSE/12.0.731.0 (KHTML, like Gecko) Chrome/12.0.731.0 Safari/534.29 The loop device of a domU seems to get stuck when using vm-install to install a domU, specifically CentOS 5. I tried this on two different machines, and I am getting the same result. Reproducible: Always Steps to Reproduce: 1. Open vm-install 2. Create a new virtual machine, select RHEL5 as the type, paravirt mode, default disk image file, and this installation URL: http://centos.mirror.facebook.net/5/os/x86_64/ 3.Complete the installation Actual Results: Once the OS is installed, the domU reboots, and never comes back. The loop device appears to be stuck at this point. If I do a "losetup -a" the command goes into D state and never returns. Expected Results: The domU should have rebooted. After the losetup hang, I sent a few commands to /proc/sysrq-trigger. The results are near the bottom. The loop0, umount, and losetup call traces are particularly interesting. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c1
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c2
--- Comment #2 from Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c3
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c4
--- Comment #4 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c5
Lele Forzani
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c6
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c7
Roger Cruz
Not NULL, but pointing to memory that is (no longer?) present. Without /var/log/boot.msg (or whatever its equivalent is on Centos) we won't be able to tell whether the page was there initially.
Quite possibly a dangling pointer, particularly because the stack trace seems to indicate attempts to use xsave, support for which is off by default in the hypervisor (and the kernel shouldn't even detect its availability) mostly because it's incompatible with live migration up to and including Xen 4.1.0.
Hi Jan, We've been seeing the same type of exception and same stack trace when running with 2.6.38 PV-OPS kernel on top of Xen 4.0. Unfortunately, all I have is a picture of the stack trace so it won't help you with the information you asked for here. My question for you is regarding this statement "attempts to use xsave, support for which is off by default in the hypervisor"... How can I confirm that xsave support is turned off? This code only gets excuted when the xsaveopt feature is present in the CPU and when in dom0 I do a 'cat /proc/cpuinfo', I can see the xsaveopt capability is turned on so the dom0 kernel is using it. flags : fpu de tsc msr pae cx8 apic sep cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc aperfmperf pni pclmulqdq est ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes avx hypervisor lahf_lm ida arat epb xsaveopt pln pts dts -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c8
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c9
--- Comment #9 from Roger R. Cruz
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c10
--- Comment #10 from Jan Beulich
Thanks for looking into this. I am still wondering, though, why it doesn't work even when the flag is ON
Which "the flag"? You said you run on 4.0, and I turning on xsave support in that hypervisor is, hmm, adventurous (depending on how many post-4.0 patches you have backported).
and recognized by the PV OPS kernel. Looking at
I generally can't comment on the pv-ops kernel (and in particular, as none of our products contains one, here - this would be better discussed on e.g. xen-devel).
the kernel code, it seems that there is code to handle the XSAVEOPT capability. Why is that we get a page for the fpu.state whose PTE is 0? And why wouldn't this problem happen all the time. It is very sporadic for us as to when it happens...
As said originally - perhaps just a dangling pointer that once in a while refers to a page ballooned out of Dom0. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c11
--- Comment #11 from Roger R. Cruz
"While the xsave feature flag is clear, the xsaveopt one is set, allowing the code to do things it shouldn't."
I was referring to the xsaveopt flag which shows up in my /proc/cpuinfo. We are running 4.0.1 and we have back-ported a few bug fixes from mainline.
"As said originally - perhaps just a dangling pointer that once in a while refers to a page ballooned out of Dom0"
Yes, that is what it feels like based on the fact that some VMs couldn't start because there wasn't enough memory, Thanks again. Roger R. Cruz -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c12
--- Comment #12 from Jan Beulich
I was referring to the xsaveopt flag which shows up in my /proc/cpuinfo. We are running 4.0.1 and we have back-ported a few bug fixes from mainline.
As per what you indicated earlier? There was no "xsave" feature shown, and that tells me that you didn't enable xsave in the hypervisor. Hence you can't expect Dom0 to be able to use anything xsave related. (Yes, the hypervisor shouldn't expose xsaveopt without xsave, and I submitted a patch to that respect earlier today. And yes, it seems bogus that the kernel uses the xsaveopt feature indication without qualifying with the xsave one, but that's the way it got coded, so we've got to work around this in the Xen bits.) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c13
--- Comment #13 from Roger R. Cruz
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c14
--- Comment #14 from Roger R. Cruz
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c15
--- Comment #15 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c16
--- Comment #16 from Roger R. Cruz
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c17
--- Comment #17 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c18
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=687368
https://bugzilla.novell.com/show_bug.cgi?id=687368#c19
Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com