Dear OpenSuse Team: Earlier today I sent a request to the list about a 42.3 DomU crashing. Olaf replied, and I've installed the new kernel, and I'll watch and see. I'm very grateful for the help. I'm sorry to post a second question, but I'm having a simliar-but-different problem on a different host and guest, and have reached an impasse. A few weeks ago, I took a copy of our crashy 42.3 DomU guest, and copied it to a new guest, just making a copy of the disk, and changing the name and IP address and booting it on a different physical host. I then did zypper dup from 42.3->15.0->15.1. This was intended as a "test run", if you like, to predict how client software would react to the upgrade. So now I have an upgraded *copy* of my machine, running 15.1. All patches applied. And it's running on a different host, which was a fresh load of 15.1, also with all patches applied. Linux host 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64 x86_64 x86_64 GNU/Linux This guest has a problem as well, in that, under sustained high network/disk loads, the guest freezes up completely. This happened twice today - I can pretty much *make* it happen just by starting a local rsync (i.e. on a crossover cable) of it's main big data partition (3TB).... about every other attempt to copy the entire partition via rsync over ssh will freeze the guest. I get the same annoyingly terse message on the physical host: [92630.531549] vif vif-6-0 vif6.0: Guest Rx stalled [92630.531613] br0: port 2(vif6.0) entered disabled state but, unlike my 42.3 guest, this one gives *no* log outputs or data at all on the guest. No BUG, no CPU lockup, no kernel traceback, nothing. I left a high priority shell on the hvc0 console, which, when the 42.3 guest had its problem, was still sort of responsive, and I left "top -n 1; sleep 15" running in a while true loop on it... but it was completely frozen. I could see the final top before the hang, and there was nothing to suggest a problem. The guest just... hangs. Unlike the frozen 42.3 guest, which showed pretty much continuous "run" state, the 15.1 guest seems to do the more-or-less "normal" behavior in xentop - switching between "b" and "r" modes, and showing normal utilization patterns. But the guest itself is stuck tight. I have seen mentions about the grant frames issue, and I did apply the higher value to the host and guests: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 but this is still happening. Now here's the crazy part: I sat around trying to poke at the frozen guest and try different things before destroying it, and, skimming down my "xl" choices, I found "xl trigger". I had already tried pausing and unpausing the guest - that did nothing. But when I tried xl trigger (at random I tried the first option, so: xl trigger 6 nmi), the guest CAME BACK ONLINE! It said this: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue on the console. I also saw it in /var/log/messages, followed by: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 554c072567f2 wd_last: 54137c19cb3c mask: ffffffffffffffff clocksource: 'tsc' cs_now: 2d696bb78816d4 cs_last: 2d6640097d695e mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog On the host in /var/log/messages, I saw: [93760.637546] vif vif-6-0 vif6.0: Guest Rx ready [93760.637595] br0: port 2(vif6.0) entered blocking state [93760.637598] br0: port 2(vif6.0) entered forwarding state And, apart from the rsync/sshd processes (which I suspect the remote side had given up), everything else came right back online. MySQL, for example, was still running on the guest without issue, in fact apart from the log entries I cite above, there was no indication that the machine had even been broken. The 5- and 10-minute load averages were way up in the 30s... but everything else was fine. Prior to the freeze, the guest was continuously showing a load average of about 3.0 - with the rsync and sshd processes in run mode, and that's it - just as I'd expect. The guest is provisioned thusly: name="gggv" description="gggv" uuid="13289776-1c74-9ade-4242-8f7453249832" memory=90112 maxmem=90112 vcpus=26 cpus="4-31" on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" extra="elevator=noop" disk=[ '/b/xen/gggv/gggv.root,raw,xvda1,w', '/b/xen/gggv/gggv.swap,raw,xvda2,w', '/b/xen/gggv/gggv.xa,raw,xvdb1,w', ] vif=[ 'mac=00:16:3f:04:05:41,bridge=br0', 'mac=00:16:3f:04:05:42,bridge=br1', ] vfb=['type=vnc,vncunused=1'] and is also the only guest running on its host. The host has: GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256" and is in every other respect an essentially fresh 15.1 load. I'm thinking that this is a different problem than my 42.3 guest problem, but I don't know what to do with it. My next move was to make sure my hardware (and data, and OS!) were okay. So I moved the root filesystem of my upgraded guest aside, and did a fresh load of 15.1 onto a new root filesystem. When I use *that* to boot my guest, it seems to be stable. High network activity does not appear to stop it - I've done 5 or 6 copies of my huge filesystem in that mode without issue. Of course I'd like to do more cycles to be sure, but it seems stable compared to when the upgraded root is in place, when I can make the machine freeze up on almost every (or every other) copy attempt. The only thing I can think of that is different here, then, would be that, maybe, since the guest has been zypper dup'ed over time all the way back from 13.2 (the last time it was built fresh), that maybe it's inherited some old garbage that could be causing this. It seems to me that a zypper dup'ed guest "should" work properly, especially when it is the same version and kernel as the physical host; but, again (sorry) I have these freezes. So just for laughs, I ran an lsmod in both modes, and sorted and diffed them: The "clean" guest (which appears to be stable), has these four kernel modules not present on the upgraded guest: iptable_raw nf_conntrack_ftp nf_nat_ftp xt_CT The "dup'ped" guest (which seems to be crashable on a large local rsync) has these modules not present on a clean install: auth_rpcgss br_netfilter bridge grace intel_rapl ipt_MASQUERADE llc lockd nf_conntrack_netlink nf_nat_masquerade_ipv4 nfnetlink nfs_acl nfsd overlay sb_edac stp sunrpc veth xfrm_algo xfrm_user xt_addrtype xt_nat Both guests share these additional sysctl.conf settings: kernel.panic = 5 vm.panic_on_oom = 2 vm.swappiness = 0 net.ipv6.conf.all.autoconf = 0 net.ipv6.conf.default.autoconf = 0 net.ipv6.conf.eth0.autoconf = 0 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_tw_reuse = 0 The dup'ped guest has these additional sysctl.conf settings: net.ipv4.tcp_tw_recycle = 0 net.core.netdev_max_backlog=300000 net.core.somaxconn = 2048 net.core.rmem_max=67108864 net.core.wmem_max=67108864 net.ipv4.ip_local_port_range=15000 65000 net.ipv4.tcp_sack=0 net.ipv4.tcp_rmem=4096 87380 67108864 net.ipv4.tcp_wmem=4096 65536 67108864 all of which have, more or less, worked well in the past (when everything was on 42.3) and may or may not be relevant here. I'm sorry, I feel like I'm missing something obvious here, but I can't see it. I would be grateful for any guidance or insights into this. Yes, in addition to trying to upgrade my client in place to 15.1, I could just build a new guest by hand, but that would be even more time-consuming and seems like it should not be necessary. If I might quote from the kernel, "Dazed and confused, but trying to continue" is exactly how I'm feeling here. Why could this guest be hanging? Why does an NMI bring it back? What should I do next? Anything anyone would be willing to point me to or suggest would be gratefully appreciated. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org