[opensuse-virtual] 15.1 Xen DomU freezing under high network/disk load, recoverable with NMI trigger

21 Dec 2019

      Dear OpenSuse Team:

Earlier today I sent a request to the list about a 42.3 DomU crashing.
Olaf replied, and I've installed the new kernel, and I'll watch and
see.  I'm very grateful for the help.  I'm sorry to post a second
question, but I'm having a simliar-but-different problem on a
different host and guest, and have reached an impasse.

A few weeks ago, I took a copy of our crashy 42.3 DomU guest, and
copied it to a new guest, just making a copy of the disk, and changing
the name and IP address and booting it on a different physical host.
I then did zypper dup from 42.3->15.0->15.1.   This was intended as a
"test run", if you like, to predict how client software would react to
the upgrade.  So now I have an upgraded *copy* of my machine, running
15.1.  All patches applied.  And it's running on a different host,
which was a fresh load of 15.1, also with all patches applied.

Linux host 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC
2019 (8f4a495) x86_64 x86_64 x86_64 GNU/Linux

This guest has a problem as well, in that, under sustained high
network/disk loads, the guest freezes up completely.  This happened
twice today - I can pretty much *make* it happen just by starting a
local rsync (i.e. on a crossover cable) of it's main big data
partition (3TB).... about every other attempt to copy the entire
partition via rsync over ssh will freeze the guest.   I get the same
annoyingly terse message on the physical host:

[92630.531549] vif vif-6-0 vif6.0: Guest Rx stalled
[92630.531613] br0: port 2(vif6.0) entered disabled state

but, unlike my 42.3 guest, this one gives *no* log outputs or data at
all on the guest.  No BUG, no CPU lockup, no kernel traceback,
nothing.  I left a high priority shell on the hvc0 console, which,
when the 42.3 guest had its problem, was still sort of responsive, and
I left "top -n 1; sleep 15" running in a while true loop on it... but
it was completely frozen.  I could see the final top before the hang,
and there was nothing to suggest a problem.  The guest just... hangs.

Unlike the frozen 42.3 guest, which showed pretty much continuous
"run" state, the 15.1 guest seems to do the more-or-less "normal"
behavior in xentop - switching between "b" and "r" modes, and showing
normal utilization patterns.  But the guest itself is stuck tight.

I have seen mentions about the grant frames issue, and I did apply the
higher value to the host and guests:
# xen-diag gnttab_query_size 0 # Domain-0
domid=0: nr_frames=1, max_nr_frames=64
# xen-diag gnttab_query_size 1 # Xenstore
domid=1: nr_frames=4, max_nr_frames=4
# xen-diag gnttab_query_size 6 # My guest
domid=6: nr_frames=17, max_nr_frames=256
but this is still happening.

Now here's the crazy part:

I sat around trying to poke at the frozen guest and try different
things before destroying it, and, skimming down my "xl" choices, I
found "xl trigger".  I had already tried pausing and unpausing the
guest - that did nothing.  But when I tried xl trigger (at random I
tried the first option, so:  xl trigger 6 nmi), the guest CAME BACK
ONLINE!  It said this:

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

on the console.  I also saw it in /var/log/messages, followed by:

clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc'
as unstable because the skew is too large:
clocksource: 'xen' wd_now: 554c072567f2 wd_last: 54137c19cb3c mask:
ffffffffffffffff
clocksource: 'tsc' cs_now: 2d696bb78816d4 cs_last: 2d6640097d695e
mask: ffffffffffffffff
tsc: Marking TSC unstable due to clocksource watchdog

On the host in /var/log/messages, I saw:

[93760.637546] vif vif-6-0 vif6.0: Guest Rx ready
[93760.637595] br0: port 2(vif6.0) entered blocking state
[93760.637598] br0: port 2(vif6.0) entered forwarding state

And, apart from the rsync/sshd processes (which I suspect the remote
side had given up), everything else came right back online.  MySQL,
for example, was still running on the guest without issue, in fact
apart from the log entries I cite above, there was no indication that
the machine had even been broken.  The 5- and 10-minute load averages
were way up in the 30s... but everything else was fine.

Prior to the freeze, the guest was continuously showing a load average
of about 3.0 - with the rsync and sshd processes in run mode, and
that's it - just as I'd expect.  The guest is provisioned thusly:

name="gggv"
description="gggv"
uuid="13289776-1c74-9ade-4242-8f7453249832"
memory=90112
maxmem=90112
vcpus=26
cpus="4-31"
on_poweroff="destroy"
on_reboot="restart"
on_crash="restart"
on_watchdog="restart"
localtime=0
keymap="en-us"
type="pv"
kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
extra="elevator=noop"
disk=[
        '/b/xen/gggv/gggv.root,raw,xvda1,w',
        '/b/xen/gggv/gggv.swap,raw,xvda2,w',
        '/b/xen/gggv/gggv.xa,raw,xvdb1,w',
        ]
vif=[
        'mac=00:16:3f:04:05:41,bridge=br0',
        'mac=00:16:3f:04:05:42,bridge=br1',
        ]
vfb=['type=vnc,vncunused=1']

and is also the only guest running on its host.  The host has:

GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
gnttab_max_frames=256" and is in every other respect an essentially
fresh 15.1 load.

I'm thinking that this is a different problem than my 42.3 guest
problem, but I don't know what to do with it.

My next move was to make sure my hardware (and data, and OS!) were
okay.  So I moved the root filesystem of my upgraded guest aside, and
did a fresh load of 15.1 onto a new root filesystem.  When I use
*that* to boot my guest, it seems to be stable.  High network activity
does not appear to stop it - I've done 5 or 6 copies of my huge
filesystem in that mode without issue.  Of course I'd like to do more
cycles to be sure, but it seems stable compared to when the upgraded
root is in place, when I can make the machine freeze up on almost
every (or every other) copy attempt.

The only thing I can think of that is different here, then, would be
that, maybe, since the guest has been zypper dup'ed over time all the
way back from 13.2 (the last time it was built fresh), that maybe it's
inherited some old garbage that could be causing this.   It seems to
me that a zypper dup'ed guest "should" work properly, especially when
it is the same version and kernel as the physical host; but, again
(sorry) I have these freezes.

So just for laughs, I ran an lsmod in both modes, and sorted and diffed them:

The "clean" guest (which appears to be stable), has these four kernel
modules not present on the upgraded guest:

iptable_raw
nf_conntrack_ftp
nf_nat_ftp
xt_CT

The "dup'ped" guest (which seems to be crashable on a large local
rsync) has these modules not present on a clean install:

auth_rpcgss
br_netfilter
bridge
grace
intel_rapl
ipt_MASQUERADE
llc
lockd
nf_conntrack_netlink
nf_nat_masquerade_ipv4
nfnetlink
nfs_acl
nfsd
overlay
sb_edac
stp
sunrpc
veth
xfrm_algo
xfrm_user
xt_addrtype
xt_nat

Both guests share these additional sysctl.conf settings:

kernel.panic = 5
vm.panic_on_oom = 2
vm.swappiness = 0
net.ipv6.conf.all.autoconf = 0
net.ipv6.conf.default.autoconf = 0
net.ipv6.conf.eth0.autoconf = 0
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 0

The dup'ped guest has these additional sysctl.conf settings:

net.ipv4.tcp_tw_recycle = 0
net.core.netdev_max_backlog=300000
net.core.somaxconn = 2048
net.core.rmem_max=67108864
net.core.wmem_max=67108864
net.ipv4.ip_local_port_range=15000 65000
net.ipv4.tcp_sack=0
net.ipv4.tcp_rmem=4096 87380 67108864
net.ipv4.tcp_wmem=4096 65536 67108864

all of which have, more or less, worked well in the past (when
everything was on 42.3) and may or may not be relevant here.

I'm sorry, I feel like I'm missing something obvious here, but I can't
see it.  I would be grateful for any guidance or insights into this.
Yes, in addition to trying to upgrade my client in place to 15.1, I
could just build a new guest by hand, but that would be even more
time-consuming and seems like it should not be necessary.  If I might
quote from the kernel, "Dazed and confused, but trying to continue" is
exactly how I'm feeling here.  Why could this guest be hanging?  Why
does an NMI bring it back?  What should I do next?  Anything anyone
would be willing to point me to or suggest would be gratefully
appreciated.

Glen
-- 
To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org