[Bug 559047] New: XEN system hangs with high I/O load
http://bugzilla.novell.com/show_bug.cgi?id=559047 http://bugzilla.novell.com/show_bug.cgi?id=559047#c0 Summary: XEN system hangs with high I/O load Classification: openSUSE Product: openSUSE 11.2 Version: Final Platform: x86-64 OS/Version: openSUSE 11.2 Status: NEW Severity: Critical Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: g.w.kant@hunenet.nl QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.0.15) Gecko/2009102100 SUSE/3.0.15-0.1.2 Firefox/3.0.15 On two different systems I experienced system hangs when high I/O load occurs with at least one domU involved. Here I describe the situation on a HP Proliant with 16GB RAM. openSUSE 11.2 x86_64 dom0 with one (PV) openSUSE 11.2 x86_64 domU. Network configuration uses a bridge br0 with the dom0 and the domU network interfaces added to the bridge. When for example, copying a large file on the dom0 with scp from the domU to the dom0, the system dies. A log captured via a serial port is attached. I will repeat the experiment on the other machine (Super Micro based PC) and provide that log file as well. Reproducible: Always Steps to Reproduce: 1. Boot system and login as root 2. xm start domu 3. scp domu:bigfile . Actual Results: System is not responding anymore Expected Results: A stable system -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c1
--- Comment #1 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c2
--- Comment #2 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c3
--- Comment #3 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c
Charles Arnold
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c4
--- Comment #4 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c5
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c6
--- Comment #6 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c7
Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c8
--- Comment #8 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c9
--- Comment #9 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c10
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c11
--- Comment #11 from Dion Kant
How are you VMs' disks being backed?
Disks are defined via LVM disk=[ 'phy:/dev/sys/staging,xvda,w', ] -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c12
Dion Kant
Also, we'll need either the vmlinux binary or at least the disassembly of unmap_single().
Can you give me a pointer on how to obtain the disassembly of unmap_single()? vmlinux can be downloaded from: http://ftp.concero.nl/pub/kernel/vmlinux -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c13
--- Comment #13 from Dion Kant
(In reply to comment #10)
Also, we'll need either the vmlinux binary or at least the disassembly of unmap_single().
Can you give me a pointer on how to obtain the disassembly of unmap_single()?
found it:
Reading symbols from /usr/src/linux-2.6.31.5-0.1/vmlinux...done.
(gdb) disassemble unmap_single
Dump of assembler code for function unmap_single:
0xffffffff802446b0
vmlinux can be downloaded from: http://ftp.concero.nl/pub/kernel/vmlinux
-- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c14
--- Comment #14 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c15
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c16
Dion Kant
scp (main direction of network traffic) from Dom0 to DomU
vmlinux can be downloaded from: http://ftp.concero.nl/pub/kernel/vmlinux-pv2 I was not able to trigger the bug while running (on domU) staging:~/usr # while true ; do tar xf ../usr.tgz ; done over last night. Furthermore a more than two hours scp to Dom0 on domU did not trigger it either. But I will test this a bit more. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c17
song james
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c18
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c19
--- Comment #19 from Dion Kant
Dion, Could you tested this issue withount a lvm but a file format backend to see whether this bug is reproduced. like: disk=[ 'file:/disk0,xvda,w', ]
-James
Hello James and Jan, With a file backed disk, until now the bug does not seem to happen (running for several hours now). So it looks like LVM is a requirement to trigger the bug. Furthermore, I have not seen error messages from the cciss driver anymore while running with the file backed domU. Now I will use the fixed patch from Jan and go back to an LVM backed domU. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c20
Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c21
--- Comment #21 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c22
--- Comment #22 from Dion Kant
Don't do y build, yet - the change now made the condition useless. Will attach another on in a minute.
Ok I stopped the build. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c23
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c24
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c25
Dion Kant
Created an attachment (id=332248) --> (http://bugzilla.novell.com/attachment.cgi?id=332248) [details] debugging patch (kernel, v4)
Here's another version of the patch - it seems like the BUG_ON() in question still may have false positives. Hence I converted it to a warning now.
Are you able to reproduce the bug? 1. I think this is really a nasty one. The WARN_ON in do_unmap_single creates loads of logging. I run it for a long time and the system kept on running. I think now the bug does not hit in because of all the latency introduced by the logging (and the probing). (I did not save all this logging) 2. I moved the WARN_ON out of the for loop: static void do_unmap_single(struct device *hwdev, char *dma_addr, size_t size, int dir) { unsigned long flags; int i, count, nslots = ALIGN(size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT; int index = (dma_addr - io_tlb_start) >> IO_TLB_SHIFT; phys_addr_t phys = io_tlb_orig_addr[index]; for(i = 1, count = 0; i < nslots; ++i) {//temp if((phys + (i << IO_TLB_SHIFT)) != io_tlb_orig_addr[index + i]) { printk("dus[%d]: %Lx %Lx\n", index, (unsigned long long)io_tlb_orig_addr[index + i], (unsigned long long)phys + (i << IO_TLB_SHIFT)); ++count; } } WARN_ON(count); My hope was that this would reduce the logging significantly, however this does not happen. So each time it iterates over the slots most of the time it hits the printk once and logs the warning. See log in pv4-1.txt 3. I disabled the concerning WARN_ON(count), this time reducing the log with the Call Trace dumps See log in pv4-2.txt I think the introduced probing latency is still too large for the bug to occur. (Please tell me if you are unhappy with the large attachements and if you want me to remove redundant parts) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c26
--- Comment #26 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c27
--- Comment #27 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c28
--- Comment #28 from Dion Kant
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c29
--- Comment #29 from Jan Beulich
Are you able to reproduce the bug?
I've personally not seen the bug, and I'm not determined whether it has been reproduced in our lab.
2. I moved the WARN_ON out of the for loop:
Of course it was meant to be that way.
It happens that I have to convince myself that I am still able to trigger the bug. This bug did first manifests itself on a (new) supermicro machine with 4GB RAM when I was untarring a large tar file on a domU.
Without any network load? This would be the opposite of the result you posted in #16. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c30
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c31
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c32
--- Comment #32 from Dion Kant
Are you able to reproduce the bug?
I've personally not seen the bug, and I'm not determined whether it has been reproduced in our lab.
2. I moved the WARN_ON out of the for loop:
Of course it was meant to be that way.
It happens that I have to convince myself that I am still able to trigger the bug. This bug did first manifests itself on a (new) supermicro machine with 4GB RAM when I was untarring a large tar file on a domU.
Without any network load? This would be the opposite of the result you posted in #16.
I will try to narrow it further down. I think some network traffic is required to trigger it while doing domU disk access. However, I was not able to get it triggered with a transfer of data from dom0 to domU:/dev/null. I think that the combination of disk access and network traffic is required. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c33
--- Comment #33 from Dion Kant
Now having positive feedback from another tester, I'd definitely would want to hear back from you.
No problem, I will start building and testing. I was ill a couple of days. Too sick to touch my keyboard. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c34
Dion Kant
Now having positive feedback from another tester, I'd definitely would want to hear back from you.
Jan, I setup the two machines (with 8 GB and 16 GB RAM) with patch v5 and did test them with severe I/O load for more than 17 hours. Except for the aep&p logging during startup, nothing additional has been logged. Until now no unstability has been observed. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=559047
http://bugzilla.novell.com/show_bug.cgi?id=559047#c35
Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com