New subject: [Bug 623303] Xen hypervisor silent death on heavy i/o load

18 Jul 2010

      http://bugzilla.novell.com/show_bug.cgi?id=623303

http://bugzilla.novell.com/show_bug.cgi?id=623303#c0

           Summary: Xen hypervisor silent death on heavy i/o load
    Classification: openSUSE
           Product: openSUSE 11.2
           Version: Final
          Platform: x86-64
        OS/Version: openSUSE 10.2
            Status: NEW
          Severity: Critical
          Priority: P5 - None
         Component: Xen
        AssignedTo: jdouglas@novell.com
        ReportedBy: samuel.kvasnica@ims.co.at
         QAContact: qa@suse.de
          Found By: ---
           Blocker: ---

Created an attachment (id=376633)
 --> (http://bugzilla.novell.com/attachment.cgi?id=376633)
dmesg

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.6)
Gecko/20100629 Mandriva Linux/1.9.2.6-0.1mdv2010.0 (2010.0) Firefox/3.6.6

I'm getting silent nasty deadlocks on an opensuse 11.2 xen system using kernel
2.6.31.13-22-xen and xen-devel-4.0.0_21091_05-70.1. 

The hardware setup:
* Supemicro X8DTi-F
* Xeon E5530  @ 2.40GHz
* 1x Intel rs2bl080 MegaRaid (system partitions only)
* Intel 82576 Gigabit (normal LAN connection)
* Intel 82598EB 10-Gigabit AT CX4 (backup point-to-point connection used in
this test)
* 2x LSI Fusion-MPT Dual Ultra320 SCSI + external EonStor RAID (used in this
test)
+ external server to generate the data stream
+ good airco in the server room

The test setup:

* on sender:
socat -t 10 -d /dev/zero TCP4:172.16.3.1:3331,sndbuf=655350

* on receiver (tested xen system):
socat -t 10 TCP4-LISTEN:3331,rcvbuf=655350 ./socat.bin
xfs filesystem
mtu=9000
ixgbe 10g-eth is bridged 

The experimental results so far:

- both heavy disk I/O AND NIC I/O load seems to be necessary to crash,
simple 'dd' or 'iperf' is not sufficient. Usually crash happens after 20-50GB
data transfered.

- it happens when using 2 channel dm-multipath-rr (about 350MB/s) spread over 2
LSI cards or even a single channel w/o multipath (about 250MB/s)

- crash happens no matter if testing under domU (over vscsi) or directly in
dom0

- a test using more recent kernel-xen-2.6.34.1-6.1 gives the same results,
older xen 3.x was not tested so far (does it make sense ?)

- using kernel drivers or the latest vendor drivers for intel NICs or LSI
controllers makes no difference

- disabling all the gro/lro/tso in ixgbe/ethtool makes no difference

- seems like the system is stable if generating data stream localy

- lower mtu increases stability

- streaming data using UDP seems to take somewhat longer (~200GB) to crash than
using TCP (this might be statistical noise...)

- both vga and serial console give no oops output, no ping, simple frozen, dead
system without any warning in log/console

- system runs stable under normal load for many weeks, no problems when limited
to speed of single gigabit NIC

- when booting the non-xen default kernel, I'm able to stream many TB of data 
using the same setup without crash (should be repeated for several days to be
sure, but it does not feel like hardware issue). The setup is really the same,
even including the bridge setup. Even the performance (until crash) is the same
up to 350MB/s when using multipath

I'm attaching both dmesg and xen dmesg of the system.

Any expert hints how to debug this further ?

Reproducible: Always

Steps to Reproduce:
1. use netcat or socat to TCP-stream /dev/zero on one server into a file on
another server at several 100MB/s
2. wait a bit (typically less than 50GB)
3.

-- 
Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

[Bug 623303] New: Xen hypervisor silent death on heavy i/o load

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

bugzilla_noreply＠novell.com

tags

participants (1)