New subject: [opensuse-virtual] Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load (possible solution)

13 Feb 2020

      Dear OpenSuse Team:

This is a followup to my two previous threads about 42.3 and 15.1 DomU
machines hanging under high disk load.  I repeat my thanks to all of
you who responded to me and tried to help me with this.  What follows
below is an update/new report on this problem, which is now no longer
limited to just me, and which I can now duplicate even on fresh loads.
  Here we go:

Problem:  Xen DomU guests randomly stall under high network/disk
loads.  Dom0 is not affected.  Randomly means anywhere between 1 hour
and 14 days after guest boot - the time seems to shorten with (or
perhaps the problem is triggered by) increased network (and possibly
disk) activity.

Symptoms on the DomU Guest:
1. Guest machine performs normally until the moment of failure.  No
abnormal log/console entries exist.
2. At the moment of failure, the guest's network goes offline.  No
abnormal log/console entries are written at that moment.
3. Processes which were trying to connect to the network start to
consume increasing amounts of CPU.
4. Load average of the guest starts to increase, continuing upward
without apparent bound.
5. If a high-priority bash shell is left logged in on the guest hvc0
console, some commands might still be runnable; most are not.
6. If the guest console is not logged in, the console is frozen and
doesn't even echo characters.
7. Some guests will output messages on the console like this:
kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s!
8. On some others,  I will also see output like:
BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s!
9. Sometimes there is no output at all on the console.

Symptoms on the Dom0 Host:
The host is unaffected.  The only indication anything is happening on
the host are two log entries in /var/log/messages:
vif vif-6-0 vif6.0: Guest Rx stalled
br0: port 2(vif6.0) entered disabled state

Circumstances when the problem first occurred:
1. All hosts and guests were previously on OpenSuse 42.3 (Linux
4.4.180, Xen 4.9.4)
2. I upgraded one physical host to OpenSuse 15.1 (Linux 4.12.14, Xen 4.12.1).
3. The guest(s) started malfunctioning at that point.

Immediate steps taken while the guest was stalled, which did not help:
1. Tried to use high-priority shell on guest console to kill high-CPU
processes; they were unkillable.
2. Tried to use guest console to stop and restart network; commands
were unresponsive.
3. Tried to use guest console to shutdown/init 0.  This caused console
to be terminated, but guest would not otherwise shutdown.
4. Tried to use host xl interface to unplug/replug network bridges.
This appeared to work from host side, but guest was unaffected.

One thing which I accidentally discovered that *did* help:
1. Tried ending xl trigger nmi from the host to the guest.

When I trigger the stalled guest with an NMI, I get its attention.
The guest will print the following on the console:

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

In some cases (pattern not yet known), the guest will then immediately
come back online:  The network will come back online, and all
processes will slowly stop consuming CPU, and things will return to
normal.   Existing network connections were obviously terminated, but
new connections are accepted.  In that case, it's like the guest just
magically comes back to life.

When this works, the host log shows:
vif vif-6-0 vif6.0: Guest Rx ready
br0: port 2(vif6.0) entered blocking state
br0: port 2(vif6.0) entered forwarding state

And all seems well... as if the guest had never stalled.

However, this is not reliable.  In some cases, the guest will print
those messages, but the processes will NOT recover, and the network
will come back impaired, or not at all.  When that happens, repeated
NMIs do not help:  If the guest doesn't recover the first time, it
doesn't recover at all.

The *only* reliable way to fix this is to destroy the guest
completely, and recreate it.  The guest will then run fine... until
the next stall.  But of course a hard-destroy can't be a healthy thing
for a guest machine, and that's really not a solution.

Long-term mitigation steps which were tried which did not help.
1. Thought this was an SSH bug (since sshd processes were consuming
high CPU), installed latest OpenSSH.
2. Though maybe a PV problem, tried under HVM instead of PV.
3. Noted a problem with grant frames, applied the recommended fix for
that, my config now looks like:
# xen-diag gnttab_query_size 0 # Domain-0
domid=0: nr_frames=1, max_nr_frames=64
# xen-diag gnttab_query_size 1 # Xenstore
domid=1: nr_frames=4, max_nr_frames=4
# xen-diag gnttab_query_size 6 # My guest
domid=6: nr_frames=17, max_nr_frames=256
4. Thought maybe a kernel module might be at issue, reviewed list with
OpenSuse team.
5. Thought this might be a kernel mismatch, was referred to a new
kernel by OpenSuse team (4.12.13 for OpenSuse 42.3).  That changed
some of the console output behavior and logging, but did not solve the
problem.
6. Thought this might be a general OS mismatch, tried upgrading the
guest to OpenSuse 15.1/Linux 4.12.14/Xen 4.12.1.  In this
configuration, no console or log output is generated on the guest at
all, it just stalls.
7. Assumed (incorrectly, it now turns out) that something was just
"wrong" with my guest, tried a fresh load of host, and a fresh guest.
I thought that would solve it, but to my sadness, it did not.

Which means that this is now a reproducible bug.

Steps to reproduce:
1. Get a server.  I'm using a Dell PowerEdge R720, but this has
happened on several different Dell models.  My current server has two
16-core CPUs, and 128GB of RAM.
2. Load OpenSuse 15.1 (which includes Xen 4.12.1) on the server.  Boot
it up in Xen Dom0/host mode.
3. Create a new guest machine, also with 15.1/4.12.1.
4. Fire up the guest.
5. Put a lot of data on the guest (my guest has 3 TB of files and data).
6. Plug a crossover cable into your server, and plug the other end
into some other Linux machine.
7. From that other machine, start pounding the guest.  An rsync of the
entire data partition is a great way to trigger this.  If I run
several outbound rsyncs together, I can crash my guest in under 48
hours.  If I run 4 or 5, I can often crash the guest in just 2 hours.
If you don't want to damage your SSDs on your other machine, here's my
current command (my host is 192.168.1.10, and my guest is
192.168.1.11, so I plug in some other machine and make it, say,
192.168.1.12, and then run:

nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &

Where /a is my directory full of user data.  4-6 of these running
simultaneously will bring the guest to its knees in short order.

On my most recent test, I did the NMI trigger thing, and found this in
the guest's /var/log/messages after sending the trigger (i've removed
tagging and timestamping for clarity:)

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange powersaving mode enabled?
Dazed and confused, but trying to continue
clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc'
as unstable because the skew is too large:
clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask:
ffffffffffffffff
clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074
mask: ffffffffffffffff
tsc: Marking TSC unstable due to clocksource watchdog
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
     pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap
workqueue mm_percpu_wq: flags=0x8
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
     pending: vmstat_update
workqueue writeback: flags=0x4e
   pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256
     in-flight: 28593:wb_workfn
workqueue kblockd: flags=0x18
   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
     pending: blk_mq_run_work_fn, blk_mq_timeout_work
pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125

That led me to search around, and I tripped over this:
https://wiki.debian.org/Xen/Clocksource , which describes a guest
hanging with the message "clocksource/0: Time went backwards/"
Although I did not see this message, and this is not directly on point
with OpenSuse (since our /proc structure doesn't include some of the
switches mentioned), I did notice clocksource references in the logs
(see above), and that led me back to:
https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/ch...,
and specifically the tsc_mode setting.  I have no idea if it's
relevant, but I since I'm out of ideas and have nothing better to try,
I have now booted my guest into tsc_mode=1 and am stress testing it to
see if it fares any better this way.

I had originally thought that I was the only person with this problem,
and that's why I thought a fresh guest would fix it - the problem
followed me around different servers, so that made sense.  Over the
past weeks I've set up a fresh guest on my fresh host, and, just on a
whim, did the above stress testing on it... it lasted for 36 hours.

That led me to start searching the net again, and I found that, just
in the past few weeks, another person has reported what seems to be
the same problem, only he reported it to Xen-users (so I assume he's
on a different distro - see
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
for the overall message.)  Nobody has responded to him yet... but I'm
about to... I'm going to send this report there too.   He states that
the problem is limited to Xen 4.12 and 4.13, and that rolling back to
Xen 4.11 solves the problem.

Meanwhile, I'm hoping that these updated details and history spark
something new for some of you here.  Do any of you have any ideas on
this?  Any thoughts, guidance, musings, etc., anything at all would be
appreciated.

As I said I'm sending this to Xen now too.  But since it never
occurred to me that I could upgrade 42.3 to a 4.12 kernel - that was a
surprise to me! - I guess there is another question I'd like to ask
this group:

It is possible, somehow, under OpenSuse 15.1, to "downgrade" Xen to
4.11, without harming anything else?  And if so, what procedure would
I follow for that?

Again, thank you all for your patience and help, I am very grateful!

Glen
-- 
To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org

[opensuse-virtual] Xen 4.12 DomU hang / freeze / stall under high network/disk load (single thread now, and updates)

Glen

Olaf Hering

Glen

Glen

Dario Faggioli

Glen

Glen

Dario Faggioli

Glen

tags

participants (3)