[Bug 553690] New: Xenified kernel crashes during F12 PV DomU's install packages deployment phase
http://bugzilla.novell.com/show_bug.cgi?id=553690 Summary: Xenified kernel crashes during F12 PV DomU's install packages deployment phase Classification: openSUSE Product: openSUSE 11.2 Version: RC 2 Platform: x86-64 OS/Version: Other Status: NEW Severity: Major Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: bderzhavets@yahoo.com QAContact: qa@suse.de Found By: --- User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.4) Gecko/20091027 Fedora/3.5.4-1.fc12 Firefox/3.5.4 Screen ( and system as entity) freeze during F12 PV DomU install via profile:- [root@fedora12sda vms]# cat f12.install name="F12PV" memory=1024 disk = ['phy:/dev/sda8,xvda,w' ] vif = [ 'bridge=br0' ] vfb = [ 'type=vnc,vncunused=1'] kernel = "./vmlinuz" ramdisk = "./initrd.img" vcpus=1 on_reboot = 'restart' on_crash = 'restart' where installation vmlinuz & initrd.img obtained via wget from F12 HTTP Mirror (local or remote no matter) # vncviewer localhost:0 Starts installer fine and allows to get until packages deployment phase. Three attempts have been made with the same result. The system freeze when DomU install gets into packaging deploying phase, with the 3 led of the numeric pad blinking Message from syslogd@dhcppc5 kernel: [960.891531] Call trace Code: 00 00 85 c0 75 ad 4d . . . . . . . . . . . . kernel: [960.891593] CR2: ffffffffffffcb0 I've also tried text mode install , to avoid vnc console output. With profile:- # cat f12.instext name="F12PV" memory=1024 disk = ['phy:/dev/sda8,xvda,w' ] vif = [ 'bridge=br0' ] kernel = "./vmlinuz" ramdisk = "./initrd.img" vcpus=1 on_reboot = 'restart' on_crash = 'restart' # xm create -c f12.instext System crashes in packages deployment phase again Setting up serial console to capture kernel trace might take several days. Reproducible: Always Steps to Reproduce: Attempt to create F12 PV DomU. Actual Results: Xenified kernel crashes. Expected Results: PV DomU gets installed -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c1
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c2
--- Comment #2 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c3
--- Comment #3 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c4
--- Comment #4 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c5
--- Comment #5 from Jan Beulich
End of output of kernel console in case of crash :-
do_page_fault page_fault memcpy_c swiotlb_balance
swiotlb_bounce ???
unmap_single swiotlb_umap_sg_attrs _ata_sg_clean __ata_qc_comlete _ata_qc_comlete _ata_qc_comlete_multiple ahci_interrupt handle_IRQ_event handle_level_irq evtchn_do_upcall do_hypercall_callback 0xfffff...f802063ac xen_safe_halt xen_idle cpu_idle rest_init start_kernel x86_64_start_reservations x86_64_start_kernel
Just the function names don't tell much, unfortunately. However, it seems inconsistent that you have unmap_single() and memcpy_c() on the stack: swiotlb_bounce() calls memcpy() only for DMA_TO_DEVICE, but do_unmap_single() passes DMA_FROM_DEVICE. This may indicate there's earlier corruption, and hence we'll get nowhere without seeing the full hypervisor and kernel logs, i.e. we need to wait for you to set up serial. Btw., does this also occur for file:/ backed guest disks? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c6
--- Comment #6 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c7
--- Comment #7 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c8
--- Comment #8 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c9
--- Comment #9 from Jan Beulich
(XEN) mm.c:4206:d0 Global bit is set to kernel page f6454555a9 (XEN) mm.c:4206:d0 Global bit is set to kernel page 4736f480a0
are the indicators of the beginning problems (in the log this is followed by severe problems in the SATA driver, likely because of interrupts no longer arriving). The frame numbers, however, are completely bogus, and I'm unaware of any code path in our kernel that could lead to the global bit to be set on a kernel page. Unless you can assist with debugging this, I don't think we can do much here without reproducing this internally.
Linux version 2.6.31.5-0.1-desktop (geeko@buildhost) (gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) ) #3 SMP Sat Nov 7 13:41:03 EST 2009
But - what kernel was this created with? Our Xen kernels should call themselves -xen, not -desktop. Did you build this yourself? We need you to use the provided kernel in order to be useful for analysis. And if rebuilding the kernel is indeed unavoidable, it'd be nice for the tag to identify it clearly is such (we do have a -desktop kernel flavor). Finally, for eventual future logs, I'd like to ask that to avoid (if possible) making the logs as redundant as this one (most messages are there several times, which likely is a result of your use of the various command line options). -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c10
--- Comment #10 from Jan Beulich
[ 1380.923876] Thread overran stack, or stack corrupted
Without knowing what kernel this is there's nothing we can do here. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c11
--- Comment #11 from Boris Derzhavets
Okay, the second log indeed comes closer to your previous description, and we see
[ 1380.923876] Thread overran stack, or stack corrupted
Without knowing what kernel this is there's nothing we can do here.
View bug:- https://bugzilla.novell.com/show_bug.cgi?id=552492 ------- Comment #20 From Jan Beulich 2009-11-06 02:23:13 MST (-) ------- (In reply to comment #18)
Sorry, my experience with Suse is limited. I would be glad to test patch with step by step instruction. I need xen-kernel source installed on the machine , but don't know where to get kernel-source-???.x86_64.rpm ( i suspect kernel-xen-source ...)
Patch for X-server suggested by you ************************************* If you don't need the exact RC2 kernel, you could try ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2/noarch/ ***************** Go to this link ***************** Index of ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2/noarch/ Up to higher level directory Name Size Last Modified File:kernel-source-2.6.31.5-0.1.99.23.2e8a968.noarch.rpm 68957 KB 11/09/2009 04:02:00 PM File:kernel-source-vanilla-2.6.31.5-0.1.99.23.2e8a968.noarch.rpm 67280 KB 11/09/2009 04:02:00 PM kernel-source-vanilla.rpm 11/09/2009 11:04:00 PM kernel-source.rpm 11/09/2009 11:04:00 PM I downloaded and installed kernel-source.rpm ( suggested by you ) total 12 lrwxrwxrwx 1 root root 32 Nov 7 04:59 linux -> linux-2.6.31.5-0.1.99.21.446052c drwxr-xr-x 25 root root 4096 Nov 8 08:10 linux-2.6.31.5-0.1.99.21.446052c drwxr-xr-x 8 root root 4096 Oct 27 20:11 packages -rw-r--r-- 1 root root 2140 Nov 7 05:01 v32.patch1 dhcppc2:~ # cd /usr/src/linux Applied your patch and built xenified kernel. dhcppc2:/usr/src/linux # make menuconfig ( tuned as xenified as usual for rebased ones) # make -j4 # make modules_install install It appears to be named 2.6.31.5-01-desktop works under Xen and brings up X-Server wit no memory limit. I don't think it's important what name it has. Xen patches and v32,patch1 are coming obviously from you. From my side only "make menuconfig" to activate Xen Dom0 kernel feature. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c12
--- Comment #12 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c13
--- Comment #13 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c14
--- Comment #14 from Boris Derzhavets
But I never said to use this self built kernel for reporting other problems, even more if you didn't even use our .config How could i use your's .config which didn't exist under /usr/src/linux ?
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c15
--- Comment #15 from Jan Beulich
How could i use your's .config which didn't exist under /usr/src/linux ?
By either installing the full set of kernel-* packages (it moved a number of time, so I'm not sure it's kernel-devel, but I'd guess it is), or more trivially by reading /proc/config.gz while our Xen kernel is running. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c16
--- Comment #16 from Boris Derzhavets
trivially by reading /proc/config.gz while our Xen kernel is running. Thanks. If i understand you right gunzip /proc/config.gz /usr/src/linux/.config should give .config to build kernel as you did
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c17
--- Comment #17 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c18
--- Comment #18 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
User bderzhavets@yahoo.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=553690#c19
--- Comment #19 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
--- Comment #20 from Jan Beulich
Kernel linux-2.6.31.5-0.1.99.21.446052c has been rebuilt with .config obtained via gunzip /proc/config.gz located on Suse 11.2 final xen instance running on the same box ,dual booting with first one (mem=4G) . It generates same serial log for crash.
Perhaps a similar one... We'll need the full log of this anyway, together with a pointer where the kernel binaries (in particular, vmlinux and any modules involved in the backtrace) used live, in order to be able to analyze it. Stack overrun/corruption unfortunately isn't the easiest thing to debug... -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
--- Comment #21 from Boris Derzhavets
together with a pointer where the kernel binaries (in particular, vmlinux and > any modules involved in the backtrace) used live,
3.Does it mean, that you want to run ?
# gdb vmlinux
# dissamble particular_module_from_trace
Search for certain [
http://bugzilla.novell.com/show_bug.cgi?id=553690
--- Comment #22 from Jan Beulich
1. Please confirm that you want me to submit serial log of crash kernel linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified kernel.
Of course I'd prefer you to use a pre-built kernel (in which case I could just retrieve the binaries I need for analysis myself), but short of that I'm indeed asking for some other consistent pair of (log,kernel).
2. If you point me to any other kernel-source-xen.rpm i can build another kernel (probably more recent) with your's .config for xenified kernel previously applied patch for X-Server and obtain serial log the kernel.
Other than the KOTD I pointed you at above there's nothing I'm aware of until the first maintenance update kernel will eventually get released.
3.Does it mean, that you want to run ? # gdb vmlinux # dissamble particular_module_from_trace Search for certain [
] mentioned is stack trace of serial log
Something along those lines, yes, but also things beyond that (like associating source level variables with registers or stack locations). Btw., to be maximally useful here (given that we're at least suspecting stack overrun/corruption), it would be a good idea for you to include "kstack=1024" on the Dom0 (kernel) command line. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690#c23
--- Comment #23 from Boris Derzhavets
(In reply to comment #21)
1. Please confirm that you want me to submit serial log of crash kernel linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified kernel.
Of course I'd prefer you to use a pre-built kernel (in which case I could just retrieve the binaries I need for analysis myself), but short of that I'm indeed asking for some other consistent pair of (log,kernel). Btw., to be maximally useful here (given that we're at least suspecting stack overrun/corruption), it would be a good idea for you to include "kstack=1024" on the Dom0 (kernel) command line. Serial log of F12 domU crash submitted as requested , kstack=1024 included in xen kernel command line. View attachment.
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690#c24
--- Comment #24 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690#c25
--- Comment #25 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690#c26
--- Comment #26 from Boris Derzhavets
So where can I pick up the corresponding binary? Sorry , i am just an independent consultant in regards of current issue I can only try to upload via ftp vmlinux , vmlinux.o to your's location. I don't have my personal site , registered in DNS
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690#c27
--- Comment #27 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690#c28
--- Comment #28 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690#c29
--- Comment #29 from Boris Derzhavets
The try it via mail attachment (just vmlinux, perhaps compressed). Done. http://free.mailbigfile.com/0982c32dc12fc361ae43d945fc43bdab/listFiles.php
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c30
--- Comment #30 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c31
--- Comment #31 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c32
--- Comment #32 from Boris Derzhavets
This log together with one of the earlier provided ones makes it clear that swiotlb code is being instructed to write over a page table, due to running off the end of a valid buffer. It is not clear however whether the buffer was originally specified improperly, or whether stored data got corrupted e.g. during I/O. Since I can't reproduce the issue myself, I'm hoping that you would be able to rebuild your kernel with the debugging patch just attached, and then try and see whether it captures the problem any earlier (and of course doesn't have any adverse side effects). Will try. Do you, btw, also run into this issue when using mem=4G on the Xen command line?
Installer hangs downloading installation image. I just cannot get so far
Also I assume you're not having the machine do any other things while starting the guest?
Sure.
And from the last log I'm having the impression that only the third guest that got started actually crashed the machine - were the first two of different type, or does the problem not always occur?
Problem occurs always. The first F12 guest been installed, crashed Dom0 either via pygrub profile (guest's /boot of ext3fs type) or via regular xm-profile (guest's /boot of ext4fs type) attempting to load DomU via already built up image. I reproduced it twice with F12 (final release guest).Now i passed packages deployment phase via installation profile. Shut down DomU , then attempted to load and crashed right away in both cases. I just submitted only one serial log.
It would also be nice if you attached "lspci -nn" output for the machine, unless you know the problem is present on two sufficiently different ones.
No problem. It's C2D E8400, ASUS P5Q3, 4x2GB Kingston 1333, SATA 250 GB Seagate Barracuda -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c33
--- Comment #33 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c34
--- Comment #34 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c35
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c36
--- Comment #36 from Boris Derzhavets
Created an attachment (id=328617) --> (http://bugzilla.novell.com/attachment.cgi?id=328617) [details] debugging patch (kernel, v2)
Sorry, oversight on my part. Should be better now.
Revert V1 and apply V2 ? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c37
--- Comment #37 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c38
--- Comment #38 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c39
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c40
--- Comment #40 from Boris Derzhavets
Created an attachment (id=328910) --> (http://bugzilla.novell.com/attachment.cgi?id=328910) [details] debugging patch (kernel, v3)
As I understand it, v2 still brought the machine down too early. I hope that v3 finally gets us forward. I'm sorry for not having spotted this earlier.
I will able to proceed with v3 on 11/23 or 11/24 -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c41
--- Comment #41 from Boris Derzhavets
As I understand it, v2 still brought the machine down too early. I hope that v3 finally gets us forward. I'm sorry for not having spotted this earlier. Kernel patched with V3 crashes at same point as with V2. Attempt to format partitions on image device. To get serial log i need to move the box again.
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c42
--- Comment #42 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c43
--- Comment #43 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c44
--- Comment #44 from Boris Derzhavets
Following the observation in https://bugzilla.novell.com/show_bug.cgi?id=551695#c10, did you ever try running a VM on file:/ rather than phy:/ (see also my similar question in #5)? No , i didn't
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c45
--- Comment #45 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c46
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c47
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c48
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c49
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c50
--- Comment #50 from Boris Derzhavets
http://bugzilla.novell.com/show_bug.cgi?id=553690
http://bugzilla.novell.com/show_bug.cgi?id=553690#c51
Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com