[opensuse-virtual] 4.23 Xen DomU's crashing/hanging after upgrading Dom0 to 15.1

20 Dec 2019

      Greetings all:

I have a number of Xen hosts, and Xen guests on those hosts, all of
which have been running reliably for users under 42.3 (and earlier
42.x versions) forever.  Up until recently all hosts and guests were
at 42.3, with all normal zypper updates applied, and running fine.

Recently, the time came to upgrade to 15.1.  I proceeded by upgrading
the physical hosts to 15.1 first. Following that step, two of my
largest and most high-volume 42.3 guests - on two entirely different
physical hosts - started crashing every few days.  The largest one
crashes the most frequently, I'll focus on that.

The physical host is a Dell R520 with (Xen showing) 32 CPUs and 128GB  of RAM.

Linux php1 4.12.14-lp151.28.32-default #1 SMP Wed Nov 13 07:50:15 UTC
2019 (6e1aaad) x86_64 x86_64 x86_64 GNU/Linux
(XEN) Xen version 4.12.1_04-lp151.2.6 (abuild@suse.de) (gcc (SUSE
Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n  Tue Nov
 5 15:20:06 UTC 2019
(XEN) Latest ChangeSet:
(XEN) Bootloader: GRUB2 2.02
(XEN) Command line: dom0_mem=4096M dom0_max_vcpus=4 dom0_vcpus_pin

The guest is the only guest on this host.  (For legacy reasons, it
uses physical partitions on the host directly, rather than file-backed
storage, but I don't feel like that should be an issue...)

name="ghv1"
description="ghv1"
uuid="c77f49c6-1f72-9ade-4242-8f18e72cbb32"
memory=124000
maxmem=124000
vcpus=24
on_poweroff="destroy"
on_reboot="restart"
on_crash="restart"
on_watchdog="restart"
localtime=0
keymap="en-us"
type="pv"
extra="elevator=noop"
kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
disk=[
        '/dev/sda3,,xvda1,w',
        '/dev/sda5,,xvda2,w',
        '/dev/sda6,,xvda3,w',
        '/dev/sdb1,,xvdb1,w',
        ]
vif=[
        'mac=00:16:3e:75:92:4a,bridge=br0',
        'mac=00:16:3e:75:92:4b,bridge=br1',
        ]
vfb=['type=vnc,vncunused=1']

It runs:

Linux ghv1 4.4.180-102-default #1 SMP Mon Jun 17 13:11:23 UTC 2019
(7cfa20a) x86_64 x86_64 x86_64 GNU/Linux

A typical xentop looks like this:

xentop - 07:13:03   Xen 4.12.1_04-lp151.2.6
3 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 134171184k total, 132922412k used, 1248772k free    CPUs: 32 @ 2100MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k)
MAXMEM(%) VCPUS NETS NETT
X(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR  VBD_RSECT  VBD_WSECT SSID
  Domain-0 -----r        607   12.9    4194304    3.1   no limit
n/a     4    0
   0        0    0        0        0        0          0          0    0
     ghv1 -----r      18351  246.5  126976000   94.6  126977024
94.6    24    2   31
9108  3240011    4        0  1132578   205040   31572906    8389002    0
  Xenstore --b---          0    0.0      32760    0.0    1341440
1.0     1    0
   0        0    0        0        0        0          0          0    0

This guest is high volume.  It runs web servers, mail list servers,
databases, docker containers, and is regularly and constantly backed
up via rsync over ssh.  It is still at 42.3.  As mentioned above, when
its host was also at 42.3, it ran flawlessly.  Only after upgrading
the host to 15.1 did these problems start.

What happens is this:

After between 2 and 10 days of uptime, the guest will start to
malfunction, with the following symptoms:

1. All network interfaces (there are two, one main, and one local
192.168.x.x) will disconnect.
2. Guest will exhibit a number of sshd processes apparently running at
high CPU.  These processes cannot be killed.
3. Guest console will be filled with messages like this:

kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck
for 67s! [sshd:1303]

These messages print 2-3 times in groups every 1-2 seconds.  There is
no pattern to the CPU IDs, all CPUs appear to be involved.

4. It will become impossible to log in to the guest console.
5. If I already have a high-priority shell logged in on the console, I
can run some commands, (like sync), but I cannot cause the guest to
shut down (init 0, for example, hangs the console, but the guest does
not exit.)  I can issue kill commands as hinted above, but they are
ignored.
6. xl shutdown is also ineffective.  I must xl destroy the guest and
re-create it.

The guest logs show things like the following (I've removed the
"kernel: and timestamps just to make this more clear"):

INFO: rcu_sched self-detected stall on CPU
 8-...: (15000 ticks this GP) idle=b99/140000000000001/0
softirq=12292658/12292658 fqs=13805
  (t=15001 jiffies g=8219341 c=8219340 q=139284)
Task dump for CPU 8:
sshd            R  running task        0   886      1 0x0000008c
 ffffffff81e79100 ffffffff810f10c5 ffff881dae01b300 ffffffff81e79100
 0000000000000000 ffffffff81f67e60 ffffffff810f8575 ffffffff81105d2a
 ffff88125e810280 ffff881dae003d40 0000000000000008 ffff881dae003d08
Call Trace:
 [<ffffffff8101b0c9>] dump_trace+0x59/0x350
 [<ffffffff8101b4ba>] show_stack_log_lvl+0xfa/0x180
 [<ffffffff8101c2b1>] show_stack+0x21/0x40
 [<ffffffff810f10c5>] rcu_dump_cpu_stacks+0x75/0xa0
 [<ffffffff810f8575>] rcu_check_callbacks+0x535/0x7f0
 [<ffffffff811010c2>] update_process_times+0x32/0x60
 [<ffffffff8110fd00>] tick_sched_handle.isra.17+0x20/0x50
 [<ffffffff8110ff78>] tick_sched_timer+0x38/0x60
 [<ffffffff81101cf3>] __hrtimer_run_queues+0xf3/0x2a0
 [<ffffffff81102179>] hrtimer_interrupt+0x99/0x1a0
 [<ffffffff8100d1dc>] xen_timer_interrupt+0x2c/0x170
 [<ffffffff810e39ec>] __handle_irq_event_percpu+0x4c/0x1d0
 [<ffffffff810e3b90>] handle_irq_event_percpu+0x20/0x50
 [<ffffffff810e7407>] handle_percpu_irq+0x37/0x50
 [<ffffffff810e3174>] generic_handle_irq+0x24/0x30
 [<ffffffff8142dce8>] __evtchn_fifo_handle_events+0x168/0x180
 [<ffffffff8142aec9>] __xen_evtchn_do_upcall+0x49/0x80
 [<ffffffff8142cb4c>] xen_evtchn_do_upcall+0x2c/0x50
 [<ffffffff81655c6e>] xen_do_hypervisor_callback+0x1e/0x40
DWARF2 unwinder stuck at xen_do_hypervisor_callback+0x1e/0x40
Leftover inexact backtrace:
 <IRQ>  <EOI>  [<ffffffff81073840>] ? leave_mm+0xc0/0xc0
 [<ffffffff81115e63>] ? smp_call_function_many+0x203/0x260
 [<ffffffff81073840>] ? leave_mm+0xc0/0xc0
 [<ffffffff81115f26>] ? on_each_cpu+0x36/0x70
 [<ffffffff81074078>] ? flush_tlb_kernel_range+0x38/0x60
 [<ffffffff811a8c17>] ? __alloc_pages_nodemask+0x117/0xbf0
 [<ffffffff811fd14a>] ? kmem_cache_alloc_node_trace+0xaa/0x4d0
 [<ffffffff811df823>] ? __purge_vmap_area_lazy+0x313/0x390
 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140
 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420
 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40
 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0
 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0
 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0
 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110
 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670
 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90
 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0
 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc
 [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed

The above comes in all at once.  Then every second or two thereafter,
I see this:

NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303]
Modules linked in: ipt_REJECT nf_reject_ipv4 binfmt_misc veth
nf_conntrack_ipv6 nf_defrag_
ipv6 xt_pkttype ip6table_filter ip6_tables xt_nat xt_tcpudp
ipt_MASQUERADE nf_nat_masquera
de_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat
nf_conntrack_ipv4 n
f_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_na
t nf_conntrack br_netfilter bridge stp llc overlay af_packet
iscsi_ibft iscsi_boot_sysfs i
ntel_rapl sb_edac edac_core crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel
 joydev xen_fbfront drbg fb_sys_fops syscopyarea sysfillrect
xen_kbdfront ansi_cprng sysim
gblt xen_netfront aesni_intel aes_x86_64 lrw gf128mul glue_helper
pcspkr ablk_helper crypt
d nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache
xen_blkfront sg dm_m
ultipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1
task: ffff881a44554ac0 ti: ffff8807b7d34000 task.ti: ffff8807b7d34000
RIP: e030:[<ffffffff810013ac>]  [<ffffffff810013ac>]
xen_hypercall_sched_op+0xc/0x20
RSP: e02b:ffff8807b7d37c10  EFLAGS: 00000206
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810013ac
RDX: 0000000000000000 RSI: ffff8807b7d37c30 RDI: 0000000000000003
RBP: 0000000000000071 R08: 0000000000000000 R09: ffff880191804908
R10: ffff880191804ab8 R11: 0000000000000206 R12: ffffffff8237c178
R13: 0000000000440000 R14: 0000000000000100 R15: 0000000000000000
FS:  00007ff9142bd700(0000) GS:ffff881dae200000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffedcb82f56 CR3: 0000001a1d860000 CR4: 0000000000040660
Stack:
 0000000000000000 00000000fffffffa ffffffff8142bd40 0000007400000003
 ffff8807b7d37c2c ffffffff00000001 0000000000000000 ffff881dae2120d0
 ffffffff81015b07 00000003810d34e4 ffffffff8237c178 ffff881dae21afc0
Call Trace:
Inexact backtrace:
 [<ffffffff8142bd40>] ? xen_poll_irq_timeout+0x40/0x50
 [<ffffffff81015b07>] ? xen_qlock_wait+0x77/0x80
 [<ffffffff810d3637>] ? __pv_queued_spin_lock_slowpath+0x227/0x260
 [<ffffffff8119edb4>] ? queued_spin_lock_slowpath+0x7/0xa
 [<ffffffff811df626>] ? __purge_vmap_area_lazy+0x116/0x390
 [<ffffffff810ac942>] ? ___might_sleep+0xe2/0x120
 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140
 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420
 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40
 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0
 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0
 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0
 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110
 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670
 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90
 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0
 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc
 [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed
Code: 41 53 48 c7 c0 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc
cc cc cc cc cc cc cc
cc cc cc 51 41 53 48 c7 c0 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc
cc cc cc cc cc cc c
c cc cc cc cc cc 51

After about 30 seconds or so, I note that there is a slight shift, in
that this line:

CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1

changes to something like:

CPU: 15 PID: 1357 Comm: sshd Tainted: G             L   4.4.180-102-default #1

The above log group continues to log, every few seconds, forever,
until I kill the guest.

The physical host is not impacted.  It remains up, alive, connected to
its networks, and functioning properly.  The only output I get on the
physical host is a one-time report:

 vif vif-6-0 vif6.0: Guest Rx stalled
 br0: port 2(vif6.0) entered disabled state

Steps I have taken:

1, I initially thought this might be a problem in openssh.  There are
reports on the net about a vulnerability in openssh versions prior to
7.3 (42.3 is at 7.2p2) in which a long string can be sent to sshd from
the outside world and cause it to spin (and lock) out of control.  I
disabled that version of sshd on the guest, and installed the (then)
latest version of openssh:  8.1p1.  The problem persisted.

2. I have tried ifdown/ifup from within the guest to try to make the
network reconnect, to no avail.

3. I have tried to unplug and replug the guest network from the host,
to make the network reconnect, also to no avail.

4. Thinking that this might be related to recent reports of issues
with grant tables in the blkfront driver, I checked usage on the DomU
when it was spinning:

/usr/sbin/xen-diag gnttab_query_size 6
domid=6: nr_frames=15, max_nr_frames=32

So it doesn't seem to be related to that issue.  (DomID was 6 because
four crashes since last physical host reboot, ugh.)  I have adjusted
the physical host to 256 as a number of people online recommended, but
just did that this morning.    I now see:

/usr/sbin/xen-diag gnttab_query_size 2
domid=2: nr_frames=14, max_nr_frames=256

but again the exhaustion issue doesn't *seem* to have happened here...
although I could be wrong.

Because of the nature of the problem, the Xen oncrash action isn't
triggered.  The host can't tell that the guest has crashed, and it
really hasn't crashed, it's just spinning, eating up CPU. The only
thing I can do is destroy the guest, and recreate it.  So where I am
now is I'm remotely polling the machine from distant lands, every 60
seconds, and having myself paged out every time there is a crash in
the hope I can try something else... but I am now out of something
elses to try.  The guest in question is a high-profile, high-usage
guest for a client that expects 24/7 uptime... so this is, to me,
rather a serious problem.

I realize that the solution here may be "just upgrade the guest to
15.1"; however, I have two problems:

1. I cannot upgrade the guest until I have support from my customer's
staff who can address their software compatibility issues pertaining
to the differences in Python, PHP, etc., between 42.3 and 15.1... so
I'm stuck here for a while.

2. In the process of running a new 15.1 guest on yet a third,
different 15.1 host, I experienced a lockup on the guest there - which
had no log entries at all and may be unrelated; however, it, too, was
only running network/disk-intensive rsyncs at the time.  I may need to
post a seprate thread about that later; I'm not done taking debugging
steps there yet.

In short, I'm out of options.  It seems to me that running a 42.3
guest on a 15.1 host shoud work, yet I am having these crashes.

Thank you in advance for any help/guidance/pointers/cluebats.

Glen
-- 
To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org