[Bug 826374] New: Xen domU chrashes with high network load
https://bugzilla.novell.com/show_bug.cgi?id=826374 https://bugzilla.novell.com/show_bug.cgi?id=826374#c0 Summary: Xen domU chrashes with high network load Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: openSUSE 12.3 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: g.w.kant@hunenet.nl QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=545304) --> (http://bugzilla.novell.com/attachment.cgi?id=545304) Kernel oops captured in syslog of chrashed domU User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:21.0) Gecko/20100101 Firefox/21.0 I observed on two different machines domU chrashes under high network load. With one dom0, running two domU's, logged in on domU_A, scp large_file domU_B:, chrashes domU_B within little time. Using 100 MbE connectivity from a remote host does not trigger this BUG in reasonable time. However transferring data using 1 GbE connectivity from a remote host does trigger this within say 30 seconds. makes domU_B Reproducible: Always Steps to Reproduce: 1. Start domU; 2. scp large file to domU, e.g. from other domU on the same dom0 or from remote host with at least 1 GbE connectivity 3. Wait until domU chrashes or hangs Actual Results: I have observed two error states: 1. The domU seems to be unresponsive; 2. The domU is rebooted (I have on_crash="restart" in config file) In both cases on dom0 console a kernel OOPS similar to the attachment is logged. The trace starts with lines similar to: BUG: unable to handle kernel paging request at 000005a8000000bb IP: [<ffffffff8028e6c6>] memcpy+0x6/0x110 PGD 0 Oops: 0002 [#1] SMP Expected Results: A stable system -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c1
--- Comment #1 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c2
--- Comment #2 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c3
--- Comment #3 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c4
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c5
--- Comment #5 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c6
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c7
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c8
Jan Beulich
I don't think the probes are triggered. Please let me know if more info is required from the crash dump.
"probes"? In any event, a crash log on its own for a kernel that you built yourself is useless for analysis - I'd need xennet.ko from that build or at least the resulting source file (so I can tell what is at line 1302, which is at least very close to the patched location), ideally both. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c9
--- Comment #9 from Dion Kant
(In reply to comment #7)
I don't think the probes are triggered. Please let me know if more info is required from the crash dump.
"probes"?
I am referring to the added BUG_ON defs. I'll add the object + source. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c10
--- Comment #10 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c11
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c12
--- Comment #12 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c13
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c14
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c15
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c16
--- Comment #16 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c17
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c18
--- Comment #18 from Dion Kant
"After that scp dies" certainly needs a few more explaining words: In what way does it die? Were there any kernel messages subsequent to the two WARN_ON()s triggering?
"After that scp dies" is indeed incorrectly phrased. At some point, the scp client stalls. I think the scp client then receives a TCP server reset and as a result the whole file is not transferred. So it looks like the socket on the server is terminated. The log on the server doesn't show much about this. I'll obtain more precise information from a following experiment and I'll try if increased logging from the ssh server helps on this. Furthermore, I found the the kernel can still panic. In a panicking trail, this happened after the WARN in skb_try_coalesce. But let's follow your suggestion and focus first on the values from bogus SKBs. So far we have delta = 0x1000 len = 0xfe88 ->truesize = 0x1500 ->data_len = 0xfe88 So, from len == from->data_len it follows that the else branche has been taken skb_headlen(from) = 0 SKB_TRUESIZE(skb_end_offset(from)) = from->truesize - delta = 0x500 Tested only a few trials, but it seems that these values reproduce. Argh, in the mean time I crashed the dom0 because of another nasty bug in the raid1(0) layer which is still there in the latest 12.3 kernel! You don't wanne know..., but I report one bug at a time. It may be interesting to know however, that on machines which did hit this bug (https://bugzilla.novell.com/show_bug.cgi?id=813889) the guest network issue appears as well. Must now wait until machine has been reset. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c19
--- Comment #19 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c20
--- Comment #20 from Dion Kant
Hmm, so is there any chance to could use an older kernel (ideally pre-3.3, with second best being the 12.2 one [i.e. 3.4-based]) in your Dom0?
If you think it helps I can set this up, next week. Do you have an URL or so to the kernel you prefer? That's because
the issues described e.g. in bug 814211 require netback changes that continue to be stuck in a queue awaiting to be tested.
Is it an idea that I start testing with those netback changes?
As to the above analysis of yours: from->len == from->data_len also implies that the SKB didn't directly come from netfront, because netfront always pulls RX_COPY_THRESHOLD bytes (unless the whole packet is smaller). The above debugging patch is intended to help prove this.
Hope I can find a time slot today to do a trial with your patch. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c21
--- Comment #21 from Jan Beulich
If you think it helps I can set this up, next week. Do you have an URL or so to the kernel you prefer?
If you ask, I'd prefer you to use a recent SLE kernel (SP2 or SP3) - see ftp://ftp.suse.com/pub/projects/kernel/kotd/.
Is it an idea that I start testing with those netback changes?
As said, they're not in any public repo yet, and the changes are not in the form of isolated patches. I'll see if I can have the untested build uploaded somewhere, if you're adventurous. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c22
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c23
--- Comment #23 from Jan Beulich
I'll do a second trial on a "virgin" kernel, with this I mean the original kernel without traces of previous patches.
No need to do that, the above already rings a bell (albeit at a first glance this is a separate bug). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c24
--- Comment #24 from Jan Beulich
So, in netif_poll we have already the problem with truesize being smaller than len
->data_len=0xfe88 ->len=0xfebc ->truesize=0x1500 ->nr_frags=1
This is supposedly impossible: xennet_get_responses() checks rx->status to be non-negative, and the sum of it and rx->offset to be at most PAGE_SIZE. Consequently xennet_fill_frags() can bump skb->len and skb->data_len by at most PAGE_SIZE at a time. With there just being a single frag, neither ->len nor ->data_len can end up being this huge. Hence something else must be coming into the picture here, and it's now becoming imperative that we see the behavior with a known good backend. One addition to above debugging patch might be to check in xennet_fill_frags() that after updating nr_frags, ->data_len is no bigger than nr_frags * PAGE_SIZE; if it is, dump ->len, ->data_len, nr_frags, and rx->status. But as said, this ought to be a different bug, possibly even outside of netfront. Hence perhaps worth opening a new bugzilla entry for this. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c25
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c26
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c27
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c28
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c29
Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c30
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c31
--- Comment #31 from Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c32
--- Comment #32 from Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c33
--- Comment #33 from Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c34
--- Comment #34 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c35
--- Comment #35 from Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c36
--- Comment #36 from Dion Kant
Just tested your patch: non of added printks triggered, got kernel panic when trying to upload file.
Are you then 100% sure that your own compiled xennet.ko is loaded? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c37
--- Comment #37 from Dion Kant
Is there any binary kernel build to workaround this issue? Dont want to downgrade 12.3 setup
You don't need to downgrade 12.3. I installed a kernel from 12.2 on the 12.3 guests. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c38
--- Comment #38 from Dion Kant
I've rebuilt xennet.ko from 3.7.10-1.16 sources with applied tentative fix (v4).
Kernel oops seems gone, but there is still endless flood in /var/log/messages
2013-07-20T13:56:25.539746+09:00 testguest kernel: [ 248.328448] tty_release: ptm0: read/write wait queue active! 2013-07-20T13:56:25.539747+09:00 testguest kernel: [ 248.328452] tty_release: ptm0: read/write wait queue active! 2013-07-20T13:56:25.539748+09:00 testguest kernel: [ 248.328454] tty_release: ptm0: read/write wait queue active!
This happens when I trying to upload file to php/apache site.
This comes from drivers/tty/tty_io.c Maybe there is something with your tty setup. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c39
--- Comment #39 from Vitaliy Tomin
Are you then 100% sure that your own compiled xennet.ko is loaded?
Absolutely, checked with modinfo xennet.
This comes from drivers/tty/tty_io.c Maybe there is something with your tty setup.
I have no special setup for tty. Xen config for this host: name="suse-web" description="None" uuid="94d68132-7528-6e9f-f4c7-de5443e3576a" memory=512 maxmem=512 vcpus=1 on_poweroff="destroy" on_reboot="restart" on_crash="destroy" localtime=0 builder="linux" bootloader="/usr/bin/pygrub" bootargs="" disk=[ 'phy:/dev/md8,xvda,w', 'file:/root/download/openSUSE-12.3-NET-x86_64.iso,xvdb:cdrom,r', ] vif=[ 'mac=00:16:3e:7e:00:34,bridge=br52', 'mac=00:16:03:01:5d:35,bridge=br53', ] nographic=1 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c40
--- Comment #40 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c41
--- Comment #41 from Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c42
--- Comment #42 from Dion Kant
Created an attachment (id=548746) --> (http://bugzilla.novell.com/attachment.cgi?id=548746) [details] netfront analysis continued v2 log
This was my bad, I doesnt update xennet.ko in initrd, after rebuilt initrd I've got all these printk. No panics for while, will try more network load.
Now it may be that all these extra logging will mask the problem. But first go back to the original patch from Jan and make sure that the resulting xennet from that one gets loaded as well. I'll add an attachment with the IPRINTK("Initialising patched virtual ethernet driver.\n"); to Jan's v4 fix so people can check easily if the patched one ended up in their running kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c43
--- Comment #43 from Dion Kant
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c44
--- Comment #44 from Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c45
--- Comment #45 from Bill Merriam
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c46
--- Comment #46 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c47
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c48
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=826374
https://bugzilla.novell.com/show_bug.cgi?id=826374#c49
--- Comment #49 from Swamp Workflow Management
http://bugzilla.novell.com/show_bug.cgi?id=826374
Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com