[Bug 919154] New: kernel 3.18.6-1.gec2a744-xen > 3.19.0-1.g8a7d5f9-xen (Kernel:Stable) upgrade renders DomU guest unbootable; trace provided
http://bugzilla.suse.com/show_bug.cgi?id=919154 Bug ID: 919154 Summary: kernel 3.18.6-1.gec2a744-xen > 3.19.0-1.g8a7d5f9-xen (Kernel:Stable) upgrade renders DomU guest unbootable; trace provided Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: x86-64 OS: openSUSE 13.2 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: grantksupport@operamail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- REF: http://lists.opensuse.org/opensuse-kernel/2015-02/msg00017.html Upgrading a running Xen DOMU lsb_release -rd Description: openSUSE 13.2 (Harlequin) (x86_64) Release: 13.2 uname -a Linux guest03 3.18.6-1.gec2a744-xen #1 SMP Fri Feb 6 21:35:46 UTC 2015 (ec2a744) x86_64 x86_64 x86_64 GNU/Linux rpm -qa | grep -i ^xen xen-4.5.0_02-346.1.x86_64 xen-libs-4.5.0_02-346.1.x86_64 xen-tools-4.5.0_02-346.1.x86_64 xen-devel-4.5.0_02-346.1.x86_64 from kernel-xen 3.18.6-1.gec2a744-xen > 3.19.0-1.g8a7d5f9-xen using Kernel:Stable, after mkinitrd reboot the guest is unbootable. Booting to a 'rescue' kernel uname -a Linux guest03 3.18.5-x86_64-linode52 #1 SMP Thu Feb 5 12:18:36 EST 2015 x86_64 x86_64 x86_64 GNU/Linux checking last FAILED boot, displays a trace journalctl -b -1 -- Logs begin at Sun 2014-11-16 14:45:10 PST, end at Thu 2015-02-12 14:25:01 PST. -- Feb 12 14:03:44 guest03 systemd-journal[110]: Runtime journal is using 8.0M (max allowed 205.1M, trying to leave 307.7M free of 1.9G available �206222 current limit 205.1M). Feb 12 14:03:44 guest03 systemd-journal[110]: Runtime journal is using 8.0M (max allowed 205.1M, trying to leave 307.7M free of 1.9G available �206222 current limit 205.1M). Feb 12 14:03:44 guest03 kernel: BRK [0x00c34000, 0x00c34fff] PUD Feb 12 14:03:44 guest03 kernel: BRK [0x00c35000, 0x00c35fff] PMD Feb 12 14:03:44 guest03 kernel: BRK [0x00c36000, 0x00c45fff] PTE Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys cpuset Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys cpu Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys cpuacct Feb 12 14:03:44 guest03 kernel: Linux version 3.19.0-1.g8a7d5f9-xen (geeko@buildhost) (gcc version 4.8.3 20141208 [gcc-4_8-branch revision 218481] (SUSE Linux) ) #1 SMP Wed Feb 11 08:59:56 UTC 2015 (8a7d5f9) Feb 12 14:03:44 guest03 kernel: Command line: root=/dev/xvdc noresume xencons=hvc0 kbdtype=us text nofb selinux=0 apparmor=0 edd=off noshell showopts splash=verbose systemd.log_level=error systemd.log_target=syslog-or-kmsg net.ifnames=0 Feb 12 14:03:44 guest03 kernel: e820: Xen-provided physical RAM map: Feb 12 14:03:44 guest03 kernel: Xen: [mem 0x0000000000000000-0x00000001007fffff] usable Feb 12 14:03:44 guest03 kernel: NX (Execute Disable) protection: active Feb 12 14:03:44 guest03 kernel: e820: last_pfn = 0x100800 max_arch_pfn = 0x80000000 Feb 12 14:03:44 guest03 kernel: e820: last_pfn = 0x100000 max_arch_pfn = 0x80000000 Feb 12 14:03:44 guest03 kernel: init_memory_mapping: [mem 0xffe00000-0xffffffff] Feb 12 14:03:44 guest03 kernel: [mem 0xffe00000-0xffffffff] page 4k Feb 12 14:03:44 guest03 kernel: BRK [0x00c46000, 0x00c46fff] PGTABLE Feb 12 14:03:44 guest03 kernel: BRK [0x00c47000, 0x00c47fff] PGTABLE Feb 12 14:03:44 guest03 kernel: init_memory_mapping: [mem 0xe0000000-0xffdfffff] Feb 12 14:03:44 guest03 kernel: [mem 0xe0000000-0xffdfffff] page 4k Feb 12 14:03:44 guest03 kernel: BRK [0x00c48000, 0x00c48fff] PGTABLE Feb 12 14:03:44 guest03 kernel: BRK [0x00c49000, 0x00c49fff] PGTABLE Feb 12 14:03:44 guest03 kernel: BRK [0x00c4a000, 0x00c4afff] PGTABLE Feb 12 14:03:44 guest03 kernel: BRK [0x00c4b000, 0x00c4bfff] PGTABLE Feb 12 14:03:44 guest03 kernel: init_memory_mapping: [mem 0x00000000-0xdfffffff] Feb 12 14:03:44 guest03 kernel: [mem 0x00000000-0xdfffffff] page 4k Feb 12 14:03:44 guest03 kernel: init_memory_mapping: [mem 0x100000000-0x1007fffff] Feb 12 14:03:44 guest03 kernel: [mem 0x100000000-0x1007fffff] page 4k Feb 12 14:03:44 guest03 kernel: RAMDISK: [mem 0x0105d000-0x01687fff] Feb 12 14:03:44 guest03 kernel: ACPI in unprivileged domain disabled Feb 12 14:03:44 guest03 kernel: Zone ranges: Feb 12 14:03:44 guest03 kernel: DMA [mem 0x00000000-0x00ffffff] Feb 12 14:03:44 guest03 kernel: DMA32 [mem 0x01000000-0xffffffff] Feb 12 14:03:44 guest03 kernel: Normal [mem 0x100000000-0x1007fffff] Feb 12 14:03:44 guest03 kernel: Movable zone start for each node Feb 12 14:03:44 guest03 kernel: Early memory node ranges Feb 12 14:03:44 guest03 kernel: node 0: [mem 0x00000000-0x1007fffff] Feb 12 14:03:44 guest03 kernel: Initmem setup node 0 [mem 0x00000000-0x1007fffff] Feb 12 14:03:44 guest03 kernel: On node 0 totalpages: 1050624 Feb 12 14:03:44 guest03 kernel: free_area_init_node: node 0, pgdat ffffffff80a9b440, node_mem_map ffff8800fb7ee000 Feb 12 14:03:44 guest03 kernel: DMA zone: 64 pages used for memmap Feb 12 14:03:44 guest03 kernel: DMA zone: 0 pages reserved Feb 12 14:03:44 guest03 kernel: DMA zone: 4096 pages, LIFO batch:0 Feb 12 14:03:44 guest03 kernel: DMA32 zone: 16320 pages used for memmap Feb 12 14:03:44 guest03 kernel: DMA32 zone: 1044480 pages, LIFO batch:31 Feb 12 14:03:44 guest03 kernel: Normal zone: 32 pages used for memmap Feb 12 14:03:44 guest03 kernel: Normal zone: 2048 pages, LIFO batch:0 Feb 12 14:03:44 guest03 kernel: setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:1 Feb 12 14:03:44 guest03 kernel: PERCPU: Embedded 22 pages/cpu @ffff8800fac00000 s51328 r8192 d30592 u524288 Feb 12 14:03:44 guest03 kernel: pcpu-alloc: s51328 r8192 d30592 u524288 alloc=1*2097152 Feb 12 14:03:44 guest03 kernel: pcpu-alloc: [0] 0 1 2 3 Feb 12 14:03:44 guest03 kernel: Swapping MFNs for PFN ac6 and fac08 (MFN 204bf93 and 27b7688) Feb 12 14:03:44 guest03 kernel: Built 1 zonelists in Zone order, mobility grouping on. Total pages: 1034208 Feb 12 14:03:44 guest03 kernel: Kernel command line: root=/dev/xvdc noresume xencons=hvc0 kbdtype=us text nofb selinux=0 apparmor=0 edd=off noshell showopts splash=verbose systemd.log_level=error systemd.log_target=syslog-or-kmsg net.ifnam Feb 12 14:03:44 guest03 kernel: PID hash table entries: 4096 (order: 3, 32768 bytes) Feb 12 14:03:44 guest03 kernel: Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Feb 12 14:03:44 guest03 kernel: Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Feb 12 14:03:44 guest03 kernel: Software IO TLB disabled Feb 12 14:03:44 guest03 kernel: Memory: 4086700K/4202496K available (5839K kernel code, 776K rwdata, 4300K rodata, 716K init, 748K bss, 115796K reserved, 0K cma-reserved) Feb 12 14:03:44 guest03 kernel: Hierarchical RCU implementation. Feb 12 14:03:44 guest03 kernel: RCU dyntick-idle grace-period acceleration is enabled. Feb 12 14:03:44 guest03 kernel: RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=4. Feb 12 14:03:44 guest03 kernel: RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4 Feb 12 14:03:44 guest03 kernel: nr_pirqs: 256 Feb 12 14:03:44 guest03 kernel: NR_IRQS:67328 nr_irqs:2624 16 Feb 12 14:03:44 guest03 kernel: Offload RCU callbacks from all CPUs Feb 12 14:03:44 guest03 kernel: Offload RCU callbacks from CPUs: 0-3. Feb 12 14:03:44 guest03 kernel: Xen reported: 2800.056 MHz processor. Feb 12 14:03:44 guest03 kernel: Console: colour dummy device 80x25 Feb 12 14:03:44 guest03 kernel: console [tty0] enabled Feb 12 14:03:44 guest03 kernel: console [hvc0] enabled Feb 12 14:03:44 guest03 kernel: Calibrating delay using timer specific routine.. 5690.40 BogoMIPS (lpj=11380811) Feb 12 14:03:44 guest03 kernel: pid_max: default: 32768 minimum: 301 Feb 12 14:03:44 guest03 kernel: Security Framework initialized Feb 12 14:03:44 guest03 kernel: AppArmor: AppArmor disabled by boot time parameter Feb 12 14:03:44 guest03 kernel: Mount-cache hash table entries: 8192 (order: 4, 65536 bytes) Feb 12 14:03:44 guest03 kernel: Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes) Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys memory Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys devices Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys freezer Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys net_cls Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys blkio Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys perf_event Feb 12 14:03:44 guest03 kernel: Initializing cgroup subsys net_prio Feb 12 14:03:44 guest03 kernel: Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 Last level dTLB entries: 4KB 512, 2MB 0, 4MB 0, 1GB 4 Feb 12 14:03:44 guest03 kernel: ftrace: allocating 21933 entries in 86 pages Feb 12 14:03:44 guest03 kernel: ------------[ cut here ]------------ Feb 12 14:03:44 guest03 kernel: WARNING: CPU: 0 PID: 0 at ../kernel/trace/ftrace.c:1939 ftrace_bug+0x2a6/0x350() Feb 12 14:03:44 guest03 kernel: Modules linked in: Feb 12 14:03:44 guest03 kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0-1.g8a7d5f9-xen #1 Feb 12 14:03:44 guest03 kernel: Hardware name: Xen 4.1.6.1 PV guest Feb 12 14:03:44 guest03 kernel: 0000000000000000 ffffffff80019715 ffffffff805b9175 0000000000000000 Feb 12 14:03:44 guest03 kernel: ffffffff800488e1 0000000000000000 ffffffff80012000 ffff8800fa540000 Feb 12 14:03:44 guest03 kernel: 0000000000000000 ffff8800fa540000 ffffffff800eb6a6 ffffffff80b2ad20 Feb 12 14:03:44 guest03 kernel: Call Trace: Feb 12 14:03:44 guest03 kernel: [<ffffffff800171ea>] dump_trace+0x7a/0x1f0 Feb 12 14:03:44 guest03 kernel: [<ffffffff800173f3>] show_stack_log_lvl+0x93/0x170 Feb 12 14:03:44 guest03 kernel: [<ffffffff80019731>] show_stack+0x21/0x50 Feb 12 14:03:44 guest03 kernel: [<ffffffff805b9175>] dump_stack+0x40/0x50 Feb 12 14:03:44 guest03 kernel: [<ffffffff800488e1>] warn_slowpath_common+0x81/0xb0 Feb 12 14:03:44 guest03 kernel: [<ffffffff800eb6a6>] ftrace_bug+0x2a6/0x350 Feb 12 14:03:44 guest03 kernel: [<ffffffff800ebbc4>] ftrace_process_locs+0x3b4/0x690 Feb 12 14:03:44 guest03 kernel: [<ffffffff80adc352>] ftrace_init+0xb3/0x156 Feb 12 14:03:44 guest03 kernel: [<ffffffff80acbca1>] start_kernel+0x491/0x4a1 Feb 12 14:03:44 guest03 kernel: ---[ end trace 0edfd0b6a21185dd ]--- Feb 12 14:03:44 guest03 kernel: ftrace faulted on writing [<ffffffff80012000>] run_init_process+0x0/0x30 Feb 12 14:03:44 guest03 kernel: ftrace record flags: 0 Feb 12 14:03:44 guest03 kernel: (0) expected tramp: ffffffff805c08b0 Feb 12 14:03:44 guest03 kernel: Brought up 1 CPUs Feb 12 14:03:44 guest03 kernel: devtmpfs: initialized Feb 12 14:03:44 guest03 kernel: pinctrl core: initialized pinctrl subsystem Feb 12 14:03:44 guest03 kernel: RTC time: 165:165:165, date: 165/165/65 Feb 12 14:03:44 guest03 kernel: NET: Registered protocol family 16 Feb 12 14:03:44 guest03 kernel: SMP alternatives: switching to SMP code Feb 12 14:03:44 guest03 kernel: Brought up 4 CPUs Feb 12 14:03:44 guest03 kernel: PCI: Fatal: No config space access function found Feb 12 14:03:44 guest03 kernel: PCI: setting up Xen PCI frontend stub Feb 12 14:03:44 guest03 kernel: ACPI: Interpreter disabled. Feb 12 14:03:44 guest03 kernel: vgaarb: loaded Feb 12 14:03:44 guest03 kernel: suspend: event channel 15 Feb 12 14:03:44 guest03 kernel: xen_mem: Initialising balloon driver. Feb 12 14:03:44 guest03 kernel: Unable to read sysrq code in control/sysrq ... dropping back to uname -a Linux guest03 3.18.5-x86_64-linode52 #1 SMP Thu Feb 5 12:18:36 EST 2015 x86_64 x86_64 x86_64 GNU/Linux fixes the problem as well. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
http://bugzilla.suse.com/show_bug.cgi?id=919154
Charles Arnold
http://bugzilla.suse.com/show_bug.cgi?id=919154
Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
Please attach (don't inline) the full log such that we have an indication as to what is going wrong, or at the very least where guest booting actually stops.
as requested -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #3 from Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #4 from grant k
So the only failures I seem to find are NFS related. Nothing Xen specific. Did you check that the guest's networking in general is okay?
Booting to kernel 3.18, the guest boots without error. Networking is fine -- both inbound and outbound. NFS is fully function. This has been a working DomU for a long time. Booting to kernel 3.19, the guest is no longer accessible via network. I can't get to it even via the xen console; I can't test any function. My only option is to reboot to a different working kernel. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
http://bugzilla.suse.com/show_bug.cgi?id=919154
Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #7 from grant k
Any diagnostic information on the state of networking in the guest would help, including frontend and backend state (in xenstore).
Hm. Not clear how to get that when I don't own the Dom0 (@Linode). I can reproduce this on a local Xen server+guest, and see what I can grab.
Without that I'm afraid a general kernel/NFS person may need to look at this first, as so far there's no indication of any Xen specific issue here.
fyi, although NFS on kernel-xen v>= 3.19 @ Guest, as you've apparently surmised from the submitted log, does not. but NFS on kernel-!xen v>= 3.19 works just fine and so does NFS on kernel-xen v>= 3.19 @ Dom0 it's just a problem in the Guest -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #8 from grant k
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
a general kernel/NFS person may need to look at this first
getting the NFS folks in the loop -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
Neil Brown
Booting to kernel 3.19, the guest is no longer accessible via network.
seems to confirm that. Has anything changed in the "Virtual ethernet driver" between 3.18 and 3.19?? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #12 from Jan Beulich
Booting to kernel 3.19, the guest is no longer accessible via network.
seems to confirm that. Has anything changed in the "Virtual ethernet driver" between 3.18 and 3.19??
Nothing at all (also not in the backend, for the avoidance of doubt). There was a change (only indirectly related to 3.19) in how stats are being collected (or more precisely the locking involved there), but an eventual problem introduced there should manifest quite obviously in e.g. soft lockup messages. And that change was also already carried through to our SLE12 kernel and hasn't caused any known problems there so far. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #13 from grant k
Re #9: This is still only an (imcomplete) DomU log. Please provide _all_ relevant logs, i.e. hypervisor, Dom0 kernel, and DomU kernel. Complete and (as requested before) as attachments rather than inlined.
those are complete dmesg and xenconsole logs. which specific logfiles, and how collected, do you need? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
http://bugzilla.suse.com/show_bug.cgi?id=919154
Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
grant k
It's not clear to me why you set this to needinfo to me: The absence of the requested Dom0 kernel and hypervisor logs is quite obvious. And while it's simple here ("dmesg" and "xl dmesg" in Dom0 are all you should need), I also don't think bugzilla is the right forum for explaining how to obtain requested data - please use available information/resources on the internet to learn what it takes.
You asked a question, it wasn't clear, so I asked. When you failed to answer, I flagged it. Nobody's twisting your arm. If clarifying your own requests is too much of a hassle, consider different employment. Good luck figuring out your broken software on your own. When it boils up in your commercial product, I'm sure your attitude will help. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
James Fehlig
http://bugzilla.suse.com/show_bug.cgi?id=919154
James Fehlig
From xenstore perspective, vif is connected and online (state 4).
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #18 from James Fehlig
http://bugzilla.suse.com/show_bug.cgi?id=919154
James Fehlig
After rebooting the domU several times, the system is in a state where the domU will no longer boot. It stops at "A start job is running for /sysroot".
Perhaps just a red herring. The filesystem may have had some corruption. After running virt-rescue on the image, the domU boots again but still encounters the network problems. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
Jan Beulich
Created attachment 626650 [details] hypervisor, dom0, and domU dmesg
I was able to reproduce this bug on an openSUSE13.2 host with openSUSE13.2 PV domU. Attached is hypervisor (xl dmesg), dom0, and domU dmesg logs
And sadly there's absolutely nothing relevant visible in any of the three logs (the IRQ 18 disabling is very certainly unrelated, albeit that may also need looking at). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #21 from lynda t
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #22 from lynda t
http://bugzilla.suse.com/show_bug.cgi?id=919154
lynda t
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #24 from Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #25 from Thomas Blume
Created attachment 635405 [details] tentative fix
I identified upstream commit 6a6dc08ff6 as lacking a counterpart in xen3-patch-3.19. The (completely untested) patch here is the expected to be missing adjustment to our netfront incarnation. Could anyone give this a try?
Tried your patch with kernel version: 4.0.4-0.g4f5e0d5-xen from factory. I still see the domU network getting disconnected. Just before it happens, I can see this in on the domU console: [ 88.637324] vif vif-0 eth0: Too many frags looks similar to the behaviour that is reported here: http://lists.xen.org/archives/html/xen-devel/2013-03/msg00404.html but I don't see a message that the device gets disabled. As reported before, the behaviour startet with kernel version 3.19. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
Jan Beulich
Tried your patch with kernel version:
4.0.4-0.g4f5e0d5-xen
from factory. I still see the domU network getting disconnected. Just before it happens, I can see this in on the domU console:
[ 88.637324] vif vif-0 eth0: Too many frags
looks similar to the behaviour that is reported here:
http://lists.xen.org/archives/html/xen-devel/2013-03/msg00404.html
but I don't see a message that the device gets disabled. As reported before, the behaviour startet with kernel version 3.19.
Interesting - I don't think any of the logs we got to see for this issue had "Too many frags" in them. Are you sure you saw this also prior to applying that patch? I'm asking because it could well be that we have two distinct issues here. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #27 from Thomas Blume
Interesting - I don't think any of the logs we got to see for this issue had "Too many frags" in them. Are you sure you saw this also prior to applying that patch? I'm asking because it could well be that we have two distinct issues here.
Yes, I saw it before the patch. But it isn't always visible when the network hangs. I also have some boot debug options active: "systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M enforcing=0 debug" that may make the message visible at all. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=919154
--- Comment #28 from Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=919154 http://bugzilla.suse.com/show_bug.cgi?id=919154#c29 boo35 boo35 <9b3e05a5@opayq.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |9b3e05a5@opayq.com --- Comment #29 from boo35 boo35 <9b3e05a5@opayq.com> --- I'm starting to hit this now - can't install up to date Opensuse guest kernels in online/hosted Xen VPS's. Looks like kernel 4.1 is close to upstream release; Probably Tumbleweed and 13.2 will soon offer it. Will 4.1 either as Xen host and/or guest fix this issue? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c30
--- Comment #30 from Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c31
--- Comment #31 from Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c32
--- Comment #32 from Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c33
--- Comment #33 from Thomas Blume
Now that I finally got around trying to repro this, there are two observations:
- I cannot reproduce any network hangs. Can anybody confirm these hangs still occur with kernel 4.2.x? If so, as detailed as possible a description under what kind of load they occur (ideally also enumerating cases where they don't occur) would be necessary. Without being able to repro the issue I don't see ways to reasonably debug this myself.
I couldn't reproduce the network hang anymore with: kernel-xen-4.2.0-2.1.gc4d41ea.x86_64
- I can trigger the "Too many frags" messages, but having looked through old logs I can chase them back to at least summer 2013, i.e. I'm not convinced that issue is really related to the one here. Still I am looking into how to eliminate those.
Yes, I can see them too: [ 51.113204] vif vif-0 eth0: Too many frags [ 51.155701] vif vif-0 eth0: Too many frags [ 51.678351] vif vif-0 eth0: Too many frags [ 51.774121] vif vif-0 eth0: Too many frags [ 52.388251] vif vif-0 eth0: Too many frags [ 53.849132] vif vif-0 eth0: Too many frags [ 53.850286] vif vif-0 eth0: Too many frags but apparently they don't have an impact on the network hang. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c34
--- Comment #34 from Thomas Blume
Created attachment 646274 [details] debugging patch
This adds some debugging code as well as a simple (but not intended to be the final) solution to the backend, which for me fully eliminates all "Too many frags" messages. IOW this might help people here determine whether what they see is related to the "Too many frags" issue.
A permanent solution will require quite a bit of code re-work, so will take some more time.
Are you still interested in the debugging output, considering the comment above? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c35
--- Comment #35 from Jan Beulich
Are you still interested in the debugging output, considering the comment above?
No, that would be of interest only if hangs are still visible. Once I've got a patch ready to deal with the "Too many frags" thing I'd of course appreciate you or others to give this a try before I merge it into any git branch. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c36
Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c37
--- Comment #37 from Thomas Blume
Created attachment 646780 [details] tentative fix
Please anyone interested give this a try.
I've tried the patch: xen107:~ # rpm -q --changelog kernel-xen | head * Fr Sep 11 2015 thomas.blume@suse.com - patches.xen/xen-bsc919153-tentative.patch: tentative fix for bsc#919154 and there is no more network hang. Still, on heavy network load, the serial console shows: -->-- xen107 login: Ä 22.397057Ü SFW2-INext-ACC-TCP IN=eth0 OUT= MAC=00:16:3e:54:89:bd:fe:ff:ff:ff:ff:ff:08:00 SRC=192.168.58.1 DST=192.168.58.107 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=60644 DF PROTO=TCP SPT=46117 DPT=22 WINDOW=29200 RES=0x00 SYN URGP=0 OPT (020405B40402080A00652F5A0000000001030307) Ä 32.843109Ü vif vif-0 eth0: Too many frags Ä 33.170557Ü vif vif-0 eth0: Too many frags Ä 33.204960Ü vif vif-0 eth0: Too many frags Ä 95.551688Ü vif vif-0 eth0: Too many frags Ä 97.864372Ü vif vif-0 eth0: Too many frags Ä 100.584173Ü vif vif-0 eth0: Too many frags Ä 100.651322Ü vif vif-0 eth0: Too many frags Ä 105.076181Ü vif vif-0 eth0: Too many frags Ä 105.085163Ü vif vif-0 eth0: Too many frags Ä 105.153476Ü vif vif-0 eth0: Too many frags Ä 105.326148Ü vif vif-0 eth0: Too many frags Ä 105.341968Ü vif vif-0 eth0: Too many frags Ä 105.371905Ü vif vif-0 eth0: Too many frags Ä 108.714202Ü vif vif-0 eth0: Too many frags Ä 108.802537Ü vif vif-0 eth0: Too many frags Ä 108.805242Ü vif vif-0 eth0: Too many frags Ä 118.147660Ü vif vif-0 eth0: Too many frags Ä 144.409872Ü vif vif-0 eth0: Too many frags Ä 144.461613Ü vif vif-0 eth0: Too many frags Ä 154.009564Ü vif vif-0 eth0: Too many frags Ä 154.034452Ü vif vif-0 eth0: Too many frags Ä 154.163568Ü vif vif-0 eth0: Too many frags Ä 154.341685Ü vif vif-0 eth0: Too many frags Ä 162.973648Ü vif vif-0 eth0: Too many frags Ä 167.532139Ü vif vif-0 eth0: Too many frags Ä 167.660511Ü vif vif-0 eth0: Too many frags Ä 167.763044Ü vif vif-0 eth0: Too many frags Ä 168.316072Ü vif vif-0 eth0: Too many frags Ä 170.331552Ü vif vif-0 eth0: Too many frags Ä 170.334015Ü vif vif-0 eth0: Too many frags Ä 170.815484Ü vif vif-0 eth0: Too many frags Ä 170.897829Ü vif vif-0 eth0: Too many frags Ä 170.941495Ü vif vif-0 eth0: Too many frags Ä 175.834838Ü net_ratelimit: 2 callbacks suppressed Ä 175.834849Ü vif vif-0 eth0: Too many frags Ä 175.945815Ü vif vif-0 eth0: Too many frags Ä 182.603568Ü vif vif-0 eth0: Too many frags Ä 202.746753Ü vif vif-0 eth0: Too many frags Ä 202.790950Ü vif vif-0 eth0: Too many frags Ä 202.828798Ü vif vif-0 eth0: Too many frags Ä 203.567032Ü vif vif-0 eth0: Too many frags Ä 203.753478Ü vif vif-0 eth0: Too many frags Ä 203.763813Ü vif vif-0 eth0: Too many frags Ä 203.938512Ü vif vif-0 eth0: Too many frags Ä 203.948050Ü vif vif-0 eth0: Too many frags Ä 205.098703Ü vif vif-0 eth0: Too many frags Ä 208.339193Ü net_ratelimit: 4 callbacks suppressed Ä 208.339210Ü vif vif-0 eth0: Too many frags Ä 208.545938Ü vif vif-0 eth0: Too many frags Ä 208.666983Ü vif vif-0 eth0: Too many frags Ä 208.765865Ü vif vif-0 eth0: Too many frags Ä 209.182452Ü vif vif-0 eth0: Too many frags Ä 209.236549Ü vif vif-0 eth0: Too many frags Ä 209.317220Ü vif vif-0 eth0: Too many frags Ä 209.727166Ü vif vif-0 eth0: Too many frags --<-- -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c38
--- Comment #38 from Jan Beulich
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c39
--- Comment #39 from Thomas Blume
I would guess from what you provided that you updated the guest kernel only, while - the fix being to netback - it is the host kernel that needs updating.
Sorry, my test setup was trashed. But even after installing the kernel on both the guest and the host: -->-- linux-1yyf:~ # modinfo netbk filename: /lib/modules/3.16.7-27-xen/kernel/drivers/xen/netback/netbk.ko alias: xen-backend:vif license: Dual BSD/GPL srcversion: 236C86330FED3938FF28BB0 depends: xenbus_be intree: Y vermagic: 3.16.7-27-xen SMP mod_unload modversions Xen signer: home:tsaupe OBS Project sig_key: CA:A3:D5:88:38:85:EA:DF:01:2B:58:EA:D7:23:02:98:2F:41:1D:C8 sig_hashalgo: sha256 parm: queue_length:ulong parm: max_tx_slots:Maximum number of slots accepted in netfront TX requests (uint) parm: copy_skb:Copy data received from netfront without netloop (bool) parm: permute_returns:Randomly permute the order in which TX responses are sent to the frontend (bool) parm: groups:Specify the number of tasklet pairs/threads to use (uint) parm: tasklets:Use tasklets instead of kernel threads (invbool) parm: bind:Bind kernel threads to (v)CPUs (bool) --<-- I can see in the journal log of the guest: -->-- Sep 15 15:48:00 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:01 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:01 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:02 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:04 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:06 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:08 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:08 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:09 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:11 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:14 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:14 xen106 kernel: vif vif-0 eth0: Too many frags Sep 15 15:48:15 xen106 kernel: vif vif-0 eth0: Too many frags --<-- when retrieving a big file via http. What information would you need from guest and/or host? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c40
--- Comment #40 from Jan Beulich
But even after installing the kernel on both the guest and the host:
You quite clearly did not install a patched kernel in the host:
linux-1yyf:~ # modinfo netbk filename: /lib/modules/3.16.7-27-xen/kernel/drivers/xen/netback/netbk.ko alias: xen-backend:vif license: Dual BSD/GPL srcversion: 236C86330FED3938FF28BB0 depends: xenbus_be intree: Y vermagic: 3.16.7-27-xen SMP mod_unload modversions Xen signer: home:tsaupe OBS Project sig_key: CA:A3:D5:88:38:85:EA:DF:01:2B:58:EA:D7:23:02:98:2F:41:1D:C8 sig_hashalgo: sha256 parm: queue_length:ulong parm: max_tx_slots:Maximum number of slots accepted in netfront TX requests (uint) parm: copy_skb:Copy data received from netfront without netloop (bool) parm: permute_returns:Randomly permute the order in which TX responses are sent to the frontend (bool) parm: groups:Specify the number of tasklet pairs/threads to use (uint) parm: tasklets:Use tasklets instead of kernel threads (invbool) parm: bind:Bind kernel threads to (v)CPUs (bool)
Good that you provided this - the patch adds a new module parameter "always_coalesce", which isn't being shown here. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c41
--- Comment #41 from Thomas Blume
(In reply to Thomas Blume from comment #39)
But even after installing the kernel on both the guest and the host:
You quite clearly did not install a patched kernel in the host:
Sorry, I made a mistake building the patch. I've corrected it now: -->-- linux-1yyf:~ # modinfo netbk filename: /lib/modules/4.2.1-0.g9c7cacf-xen/kernel/drivers/xen/netback/netbk.ko alias: xen-backend:vif license: Dual BSD/GPL srcversion: 6F56DC12C8AC1A02648D9D0 depends: xenbus_be intree: Y vermagic: 4.2.1-0.g9c7cacf-xen SMP mod_unload modversions Xen parm: queue_length:ulong parm: max_tx_slots:Maximum number of slots accepted in netfront TX requests (uint) parm: copy_skb:Copy data received from netfront without netloop (bool) parm: always_coalesce:Always fully coalesce RX data (bint) parm: permute_returns:Randomly permute the order in which TX responses are sent to the frontend (bool) parm: groups:Specify the number of tasklet pairs/threads to use (uint) parm: tasklets:Use tasklets instead of kernel threads (invbool) parm: bind:Bind kernel threads to (v)CPUs (bool) --<-- and I can confirm that there are no more "Too many frags" visible. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=919154
http://bugzilla.suse.com/show_bug.cgi?id=919154#c44
Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com