On Fri, Dec 20, 2019 at 8:27 AM Olaf Hering <olaf@aepfle.de> wrote:
Am Fri, 20 Dec 2019 08:05:12 -0800 schrieb Glen <glenbarney@gmail.com>:
It seems to me that running a 42.3 guest on a 15.1 host shoud work This is likely true, just like SLE12-SP3-LTSS domUs ought to run fine on SLE15 hosts. If the domU crashes anyway, try either the SLE12-SP4 or SLE12-SP5 kernel in domU: zypper ar -cf https://download.opensuse.org/repositories/Kernel:/SLE12-SP5/standard SLE12-SP5 zypper ref -r SLE12-SP5 zypper dup --from SLE12-SP5 Olaf
Greetings: Well, we did much better this time, in that we got about 14 days of operational uptime before there was a problem. However, this morning my guest went into the weeds again. Interestingly, under the SLE12-SP5 kernel, the behavior was somewhere between different and better, in that the network did not disconnect, and the guest was able to get most of the way through a shutdown (whereas under the older stock kernel, the network would just drop off, and the guest would fail.) Here is the dump I am getting this time on the guest: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 67s! [sshd:17376] Modules linked in: ipt_REJECT binfmt_misc xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo nf_conntrack_ipv6 nf_defrag_ipv6 xt_addrtype br_netfilter bridge stp llc xt_tcpudp xt_pkttype xt_conntrack ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c overlay iptable_filter ip_tables x_tables af_packet iscsi_ibft iscsi_boot_sysfs xen_fbfront syscopyarea sysfillrect joydev intel_rapl sysimgblt xen_netfront fb_sys_fops xen_kbdfront sb_edac crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache xen_blkfront sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: No, Unreleased kernel CPU: 4 PID: 17376 Comm: sshd Not tainted 4.12.14-88.g6c5578e-default #1 SLE12-SP5 (unreleased) task: ffff881d9dc89400 task.stack: ffffc9004bfb4000 RIP: e030:smp_call_function_many+0x224/0x280 RSP: e02b:ffffc9004bfb7bf8 EFLAGS: 00000202 RAX: 0000000000000003 RBX: ffffffff8107c4a0 RCX: ffffe8ffffd02d40 RDX: 0000000000000014 RSI: 0000000000000018 RDI: ffff8801916890c0 RBP: 0000000000000000 R08: 0000000000000000 R09: ffff880191689e40 R10: ffff8801916890c0 R11: 0000000000000000 R12: 0000000000000018 R13: ffff881dadf243c0 R14: 0000000000024380 R15: 0000000000000018 FS: 00007ff40a59d700(0000) GS:ffff881dadf00000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe1a0b0f56 CR3: 0000001ab440a000 CR4: 0000000000040660 Call Trace: ? do_kernel_range_flush+0x50/0x50 on_each_cpu+0x28/0x60 flush_tlb_kernel_range+0x38/0x60 __purge_vmap_area_lazy+0x49/0xb0 vm_unmap_aliases+0x105/0x140 change_page_attr_set_clr+0xb1/0x2e0 set_memory_ro+0x2d/0x40 bpf_int_jit_compile+0x2d4/0x3b4 bpf_prog_select_runtime+0xb9/0xf0 bpf_prepare_filter+0x56d/0x5f0 ? kmemdup+0x32/0x40 ? watchdog_nmi_disable+0x70/0x70 bpf_prog_create_from_user+0xaa/0xf0 do_seccomp+0xe1/0x610 ? security_task_prctl+0x52/0x90 SyS_prctl+0x502/0x5c0 do_syscall_64+0x74/0x160 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7ff40902f6da RSP: 002b:00007ffe1a0afcd8 EFLAGS: 00000246 ORIG_RAX: 000000000000009d RAX: ffffffffffffffda RBX: 00005629e96d5b50 RCX: 00007ff40902f6da RDX: 00005629e9919720 RSI: 0000000000000002 RDI: 0000000000000016 RBP: 00007ffe1a0afd20 R08: 0000000000000000 R09: 0000000000000005 R10: 00007ff40902f6da R11: 0000000000000246 R12: 00005629e96d5b50 R13: 00005629e96d5b50 R14: 0000000000000000 R15: 0000000000000000 Code: 93 2a 00 39 05 b2 80 10 01 89 c2 0f 86 61 fe ff ff 48 98 49 8b 4d 00 48 03 0c c5 40 c5 ed 81 8b 41 18 a8 01 74 09 f3 90 8b 41 18 <a8> 01 75 f7 eb bd 0f b6 4c 24 04 48 83 c4 08 48 89 de 5b 48 89 In addition to this, under the SLE kernel, I was also seeing groups of the following messages on the guest. BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s! Showing busy workqueues and worker pools: workqueue writeback: flags=0x4e pwq 48: cpus=0-23 flags=0x4 nice=0 active=1/256 in-flight: 24247:wb_workfn workqueue kblockd: flags=0x18 pwq 41: cpus=20 node=0 flags=0x0 nice=-20 active=1/256 pending: blk_mq_run_work_fn pwq 9: cpus=4 node=0 flags=0x0 nice=-20 active=1/256 pending: blk_mq_run_work_fn pool 48: cpus=0-23 flags=0x4 nice=0 hung=13s workers=4 idle: 19861 19862 14403 Both of the above groups repeat every 15-30 seconds. I tried sending the NMI trigger to the guest from the host, which I mentioned in my previous thread: xl trigger 2 nmi And while the kernel did respond, the system did not "recover" - presumably because it wasn't "down" in the same way. Because the network didn't go down, there was no detection of any problem on the host, and the host wasn't even aware that the guest was having issues. Because the upgraded 15.1 machine which I mentioned in my other thread was also having the occasional issue, I am working on building a new guest on 15.1 from scratch, but that is going to take time, and I am worried that I keep losing this production guest in the meantime. I would be most grateful if anyone could point me to something useful in the above output, or suggest any next steps. Thank you! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org