[Bug 623303] New: Xen hypervisor silent death on heavy i/o load
http://bugzilla.novell.com/show_bug.cgi?id=623303 http://bugzilla.novell.com/show_bug.cgi?id=623303#c0 Summary: Xen hypervisor silent death on heavy i/o load Classification: openSUSE Product: openSUSE 11.2 Version: Final Platform: x86-64 OS/Version: openSUSE 10.2 Status: NEW Severity: Critical Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: samuel.kvasnica@ims.co.at QAContact: qa@suse.de Found By: --- Blocker: --- Created an attachment (id=376633) --> (http://bugzilla.novell.com/attachment.cgi?id=376633) dmesg User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.6) Gecko/20100629 Mandriva Linux/1.9.2.6-0.1mdv2010.0 (2010.0) Firefox/3.6.6 I'm getting silent nasty deadlocks on an opensuse 11.2 xen system using kernel 2.6.31.13-22-xen and xen-devel-4.0.0_21091_05-70.1. The hardware setup: * Supemicro X8DTi-F * Xeon E5530 @ 2.40GHz * 1x Intel rs2bl080 MegaRaid (system partitions only) * Intel 82576 Gigabit (normal LAN connection) * Intel 82598EB 10-Gigabit AT CX4 (backup point-to-point connection used in this test) * 2x LSI Fusion-MPT Dual Ultra320 SCSI + external EonStor RAID (used in this test) + external server to generate the data stream + good airco in the server room The test setup: * on sender: socat -t 10 -d /dev/zero TCP4:172.16.3.1:3331,sndbuf=655350 * on receiver (tested xen system): socat -t 10 TCP4-LISTEN:3331,rcvbuf=655350 ./socat.bin xfs filesystem mtu=9000 ixgbe 10g-eth is bridged The experimental results so far: - both heavy disk I/O AND NIC I/O load seems to be necessary to crash, simple 'dd' or 'iperf' is not sufficient. Usually crash happens after 20-50GB data transfered. - it happens when using 2 channel dm-multipath-rr (about 350MB/s) spread over 2 LSI cards or even a single channel w/o multipath (about 250MB/s) - crash happens no matter if testing under domU (over vscsi) or directly in dom0 - a test using more recent kernel-xen-2.6.34.1-6.1 gives the same results, older xen 3.x was not tested so far (does it make sense ?) - using kernel drivers or the latest vendor drivers for intel NICs or LSI controllers makes no difference - disabling all the gro/lro/tso in ixgbe/ethtool makes no difference - seems like the system is stable if generating data stream localy - lower mtu increases stability - streaming data using UDP seems to take somewhat longer (~200GB) to crash than using TCP (this might be statistical noise...) - both vga and serial console give no oops output, no ping, simple frozen, dead system without any warning in log/console - system runs stable under normal load for many weeks, no problems when limited to speed of single gigabit NIC - when booting the non-xen default kernel, I'm able to stream many TB of data using the same setup without crash (should be repeated for several days to be sure, but it does not feel like hardware issue). The setup is really the same, even including the bridge setup. Even the performance (until crash) is the same up to 350MB/s when using multipath I'm attaching both dmesg and xen dmesg of the system. Any expert hints how to debug this further ? Reproducible: Always Steps to Reproduce: 1. use netcat or socat to TCP-stream /dev/zero on one server into a file on another server at several 100MB/s 2. wait a bit (typically less than 50GB) 3. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=623303
http://bugzilla.novell.com/show_bug.cgi?id=623303#c1
--- Comment #1 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=623303
http://bugzilla.novell.com/show_bug.cgi?id=623303#c3
--- Comment #3 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=623303
http://bugzilla.novell.com/show_bug.cgi?id=623303#c4
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=623303
http://bugzilla.novell.com/show_bug.cgi?id=623303#c5
--- Comment #5 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=623303
http://bugzilla.novell.com/show_bug.cgi?id=623303#c6
Jan Beulich
1. why does xen perform that much more instable on shared interrupts compared to native kernel ? Subquestion: Are the observed MCEs just a xen-fake / virtual ?
In the presence of MCEs (as with any other hardware problems), instability is expected, and how bad it is really depends on too many factors to have a chance to give proper reasons for the specific behavior. MCEs are never virtual (i.e. fake), Xen only forwards what the hardware reports to the affected VM (if a single one can be identified as the likely source of the problem).
2. what could be done for OpenSuSE dirstribution so that all drivers on PCIe cards use MSI-interrupts by default ? This would avoid quite many problems. Unfortunately, MSI is mostly either a non-default kernel module option or not implemented at all (I had to patch the megasas by hand)
That's not an OpneSuSE question, but one that needs to be brought to the kernel developer community. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com