Re: [opensuse-virtual] 4.23 Xen DomU's crashing/hanging after upgrading Dom0 to 15.1

5 Jan 2020

      On Fri, Dec 20, 2019 at 8:27 AM Olaf Hering <olaf@aepfle.de> wrote:
...
Am Fri, 20 Dec 2019 08:05:12 -0800 schrieb Glen <glenbarney@gmail.com>:
...
It seems to me that running a 42.3 guest on a 15.1 host shoud work
This is likely true, just like SLE12-SP3-LTSS domUs ought to run fine on SLE15 hosts.
If the domU crashes anyway, try either the SLE12-SP4 or SLE12-SP5 kernel in domU:
zypper ar -cf https://download.opensuse.org/repositories/Kernel:/SLE12-SP5/standard SLE12-SP5
zypper ref -r SLE12-SP5
zypper dup --from SLE12-SP5
Olaf
Greetings:

Well, we did much better this time, in that we got about 14 days of
operational uptime before there was a problem.

However, this morning my guest went into the weeds again.

Interestingly, under the SLE12-SP5 kernel, the behavior was somewhere
between different and better, in that the network did not disconnect,
and the guest was able to get most of the way through a shutdown
(whereas under the older stock kernel, the network would just drop
off, and the guest would fail.)

Here is the dump I am getting this time on the guest:

NMI watchdog: BUG: soft lockup - CPU#4 stuck for 67s! [sshd:17376]
Modules linked in: ipt_REJECT binfmt_misc xt_nat veth ipt_MASQUERADE
nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user
xfrm_algo nf_conntrack_ipv6 nf_defrag_ipv6 xt_addrtype br_netfilter
bridge stp llc xt_tcpudp xt_pkttype xt_conntrack ip6table_filter
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack libcrc32c overlay iptable_filter ip_tables
x_tables af_packet iscsi_ibft iscsi_boot_sysfs xen_fbfront syscopyarea
sysfillrect joydev intel_rapl sysimgblt xen_netfront fb_sys_fops
xen_kbdfront sb_edac crc32_pclmul crc32c_intel ghash_clmulni_intel
pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd pcspkr nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache
xen_blkfront sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc
 scsi_dh_alua scsi_mod autofs4
Supported: No, Unreleased kernel
CPU: 4 PID: 17376 Comm: sshd Not tainted 4.12.14-88.g6c5578e-default
#1 SLE12-SP5 (unreleased)
task: ffff881d9dc89400 task.stack: ffffc9004bfb4000
RIP: e030:smp_call_function_many+0x224/0x280
RSP: e02b:ffffc9004bfb7bf8 EFLAGS: 00000202
RAX: 0000000000000003 RBX: ffffffff8107c4a0 RCX: ffffe8ffffd02d40
RDX: 0000000000000014 RSI: 0000000000000018 RDI: ffff8801916890c0
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff880191689e40
R10: ffff8801916890c0 R11: 0000000000000000 R12: 0000000000000018
R13: ffff881dadf243c0 R14: 0000000000024380 R15: 0000000000000018
FS:  00007ff40a59d700(0000) GS:ffff881dadf00000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe1a0b0f56 CR3: 0000001ab440a000 CR4: 0000000000040660
Call Trace:
 ? do_kernel_range_flush+0x50/0x50
 on_each_cpu+0x28/0x60
 flush_tlb_kernel_range+0x38/0x60
 __purge_vmap_area_lazy+0x49/0xb0
 vm_unmap_aliases+0x105/0x140
 change_page_attr_set_clr+0xb1/0x2e0
 set_memory_ro+0x2d/0x40
 bpf_int_jit_compile+0x2d4/0x3b4
 bpf_prog_select_runtime+0xb9/0xf0
 bpf_prepare_filter+0x56d/0x5f0
 ? kmemdup+0x32/0x40
 ? watchdog_nmi_disable+0x70/0x70
 bpf_prog_create_from_user+0xaa/0xf0
 do_seccomp+0xe1/0x610
 ? security_task_prctl+0x52/0x90
 SyS_prctl+0x502/0x5c0
 do_syscall_64+0x74/0x160
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7ff40902f6da
RSP: 002b:00007ffe1a0afcd8 EFLAGS: 00000246 ORIG_RAX: 000000000000009d
RAX: ffffffffffffffda RBX: 00005629e96d5b50 RCX: 00007ff40902f6da
RDX: 00005629e9919720 RSI: 0000000000000002 RDI: 0000000000000016
RBP: 00007ffe1a0afd20 R08: 0000000000000000 R09: 0000000000000005
R10: 00007ff40902f6da R11: 0000000000000246 R12: 00005629e96d5b50
R13: 00005629e96d5b50 R14: 0000000000000000 R15: 0000000000000000
Code: 93 2a 00 39 05 b2 80 10 01 89 c2 0f 86 61 fe ff ff 48 98 49 8b
4d 00 48 03 0c c5 40 c5 ed 81 8b 41 18 a8 01 74 09 f3 90 8b 41 18 <a8>
01 75 f7 eb bd 0f b6 4c 24 04 48 83 c4 08 48 89 de 5b 48 89

In addition to this, under the SLE kernel, I was also seeing groups of
the following messages on the guest.

BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s!
Showing busy workqueues and worker pools:
workqueue writeback: flags=0x4e
  pwq 48: cpus=0-23 flags=0x4 nice=0 active=1/256
    in-flight: 24247:wb_workfn
workqueue kblockd: flags=0x18
  pwq 41: cpus=20 node=0 flags=0x0 nice=-20 active=1/256
    pending: blk_mq_run_work_fn
  pwq 9: cpus=4 node=0 flags=0x0 nice=-20 active=1/256
    pending: blk_mq_run_work_fn
pool 48: cpus=0-23 flags=0x4 nice=0 hung=13s workers=4 idle: 19861 19862 14403

Both of the above groups repeat every 15-30 seconds.

I tried sending the NMI trigger to the guest from the host, which I
mentioned in my previous thread:

xl trigger 2 nmi

And while the kernel did respond, the system did not "recover" -
presumably because it wasn't "down" in the same way.

Because the network didn't go down, there was no detection of any
problem on the host, and the host wasn't even aware that the guest was
having issues.

Because the upgraded 15.1 machine which I mentioned in my other thread
was also having the occasional issue, I am working on building a new
guest on 15.1 from scratch, but that is going to take time, and I am
worried that I keep losing this production guest in the meantime.

I would be most grateful if anyone could point me to something useful
in the above output, or suggest any next steps.

Thank you!
Glen
-- 
To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org