Greetings all: I have a number of Xen hosts, and Xen guests on those hosts, all of which have been running reliably for users under 42.3 (and earlier 42.x versions) forever. Up until recently all hosts and guests were at 42.3, with all normal zypper updates applied, and running fine. Recently, the time came to upgrade to 15.1. I proceeded by upgrading the physical hosts to 15.1 first. Following that step, two of my largest and most high-volume 42.3 guests - on two entirely different physical hosts - started crashing every few days. The largest one crashes the most frequently, I'll focus on that. The physical host is a Dell R520 with (Xen showing) 32 CPUs and 128GB of RAM. Linux php1 4.12.14-lp151.28.32-default #1 SMP Wed Nov 13 07:50:15 UTC 2019 (6e1aaad) x86_64 x86_64 x86_64 GNU/Linux (XEN) Xen version 4.12.1_04-lp151.2.6 (abuild@suse.de) (gcc (SUSE Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n Tue Nov 5 15:20:06 UTC 2019 (XEN) Latest ChangeSet: (XEN) Bootloader: GRUB2 2.02 (XEN) Command line: dom0_mem=4096M dom0_max_vcpus=4 dom0_vcpus_pin The guest is the only guest on this host. (For legacy reasons, it uses physical partitions on the host directly, rather than file-backed storage, but I don't feel like that should be an issue...) name="ghv1" description="ghv1" uuid="c77f49c6-1f72-9ade-4242-8f18e72cbb32" memory=124000 maxmem=124000 vcpus=24 on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" extra="elevator=noop" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" disk=[ '/dev/sda3,,xvda1,w', '/dev/sda5,,xvda2,w', '/dev/sda6,,xvda3,w', '/dev/sdb1,,xvdb1,w', ] vif=[ 'mac=00:16:3e:75:92:4a,bridge=br0', 'mac=00:16:3e:75:92:4b,bridge=br1', ] vfb=['type=vnc,vncunused=1'] It runs: Linux ghv1 4.4.180-102-default #1 SMP Mon Jun 17 13:11:23 UTC 2019 (7cfa20a) x86_64 x86_64 x86_64 GNU/Linux A typical xentop looks like this: xentop - 07:13:03 Xen 4.12.1_04-lp151.2.6 3 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 134171184k total, 132922412k used, 1248772k free CPUs: 32 @ 2100MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETT X(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 607 12.9 4194304 3.1 no limit n/a 4 0 0 0 0 0 0 0 0 0 0 ghv1 -----r 18351 246.5 126976000 94.6 126977024 94.6 24 2 31 9108 3240011 4 0 1132578 205040 31572906 8389002 0 Xenstore --b--- 0 0.0 32760 0.0 1341440 1.0 1 0 0 0 0 0 0 0 0 0 0 This guest is high volume. It runs web servers, mail list servers, databases, docker containers, and is regularly and constantly backed up via rsync over ssh. It is still at 42.3. As mentioned above, when its host was also at 42.3, it ran flawlessly. Only after upgrading the host to 15.1 did these problems start. What happens is this: After between 2 and 10 days of uptime, the guest will start to malfunction, with the following symptoms: 1. All network interfaces (there are two, one main, and one local 192.168.x.x) will disconnect. 2. Guest will exhibit a number of sshd processes apparently running at high CPU. These processes cannot be killed. 3. Guest console will be filled with messages like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303] These messages print 2-3 times in groups every 1-2 seconds. There is no pattern to the CPU IDs, all CPUs appear to be involved. 4. It will become impossible to log in to the guest console. 5. If I already have a high-priority shell logged in on the console, I can run some commands, (like sync), but I cannot cause the guest to shut down (init 0, for example, hangs the console, but the guest does not exit.) I can issue kill commands as hinted above, but they are ignored. 6. xl shutdown is also ineffective. I must xl destroy the guest and re-create it. The guest logs show things like the following (I've removed the "kernel: and timestamps just to make this more clear"): INFO: rcu_sched self-detected stall on CPU 8-...: (15000 ticks this GP) idle=b99/140000000000001/0 softirq=12292658/12292658 fqs=13805 (t=15001 jiffies g=8219341 c=8219340 q=139284) Task dump for CPU 8: sshd R running task 0 886 1 0x0000008c ffffffff81e79100 ffffffff810f10c5 ffff881dae01b300 ffffffff81e79100 0000000000000000 ffffffff81f67e60 ffffffff810f8575 ffffffff81105d2a ffff88125e810280 ffff881dae003d40 0000000000000008 ffff881dae003d08 Call Trace: [<ffffffff8101b0c9>] dump_trace+0x59/0x350 [<ffffffff8101b4ba>] show_stack_log_lvl+0xfa/0x180 [<ffffffff8101c2b1>] show_stack+0x21/0x40 [<ffffffff810f10c5>] rcu_dump_cpu_stacks+0x75/0xa0 [<ffffffff810f8575>] rcu_check_callbacks+0x535/0x7f0 [<ffffffff811010c2>] update_process_times+0x32/0x60 [<ffffffff8110fd00>] tick_sched_handle.isra.17+0x20/0x50 [<ffffffff8110ff78>] tick_sched_timer+0x38/0x60 [<ffffffff81101cf3>] __hrtimer_run_queues+0xf3/0x2a0 [<ffffffff81102179>] hrtimer_interrupt+0x99/0x1a0 [<ffffffff8100d1dc>] xen_timer_interrupt+0x2c/0x170 [<ffffffff810e39ec>] __handle_irq_event_percpu+0x4c/0x1d0 [<ffffffff810e3b90>] handle_irq_event_percpu+0x20/0x50 [<ffffffff810e7407>] handle_percpu_irq+0x37/0x50 [<ffffffff810e3174>] generic_handle_irq+0x24/0x30 [<ffffffff8142dce8>] __evtchn_fifo_handle_events+0x168/0x180 [<ffffffff8142aec9>] __xen_evtchn_do_upcall+0x49/0x80 [<ffffffff8142cb4c>] xen_evtchn_do_upcall+0x2c/0x50 [<ffffffff81655c6e>] xen_do_hypervisor_callback+0x1e/0x40 DWARF2 unwinder stuck at xen_do_hypervisor_callback+0x1e/0x40 Leftover inexact backtrace: <IRQ> <EOI> [<ffffffff81073840>] ? leave_mm+0xc0/0xc0 [<ffffffff81115e63>] ? smp_call_function_many+0x203/0x260 [<ffffffff81073840>] ? leave_mm+0xc0/0xc0 [<ffffffff81115f26>] ? on_each_cpu+0x36/0x70 [<ffffffff81074078>] ? flush_tlb_kernel_range+0x38/0x60 [<ffffffff811a8c17>] ? __alloc_pages_nodemask+0x117/0xbf0 [<ffffffff811fd14a>] ? kmem_cache_alloc_node_trace+0xaa/0x4d0 [<ffffffff811df823>] ? __purge_vmap_area_lazy+0x313/0x390 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed The above comes in all at once. Then every second or two thereafter, I see this: NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303] Modules linked in: ipt_REJECT nf_reject_ipv4 binfmt_misc veth nf_conntrack_ipv6 nf_defrag_ ipv6 xt_pkttype ip6table_filter ip6_tables xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquera de_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 n f_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_na t nf_conntrack br_netfilter bridge stp llc overlay af_packet iscsi_ibft iscsi_boot_sysfs i ntel_rapl sb_edac edac_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev xen_fbfront drbg fb_sys_fops syscopyarea sysfillrect xen_kbdfront ansi_cprng sysim gblt xen_netfront aesni_intel aes_x86_64 lrw gf128mul glue_helper pcspkr ablk_helper crypt d nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache xen_blkfront sg dm_m ultipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1 task: ffff881a44554ac0 ti: ffff8807b7d34000 task.ti: ffff8807b7d34000 RIP: e030:[<ffffffff810013ac>] [<ffffffff810013ac>] xen_hypercall_sched_op+0xc/0x20 RSP: e02b:ffff8807b7d37c10 EFLAGS: 00000206 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810013ac RDX: 0000000000000000 RSI: ffff8807b7d37c30 RDI: 0000000000000003 RBP: 0000000000000071 R08: 0000000000000000 R09: ffff880191804908 R10: ffff880191804ab8 R11: 0000000000000206 R12: ffffffff8237c178 R13: 0000000000440000 R14: 0000000000000100 R15: 0000000000000000 FS: 00007ff9142bd700(0000) GS:ffff881dae200000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffedcb82f56 CR3: 0000001a1d860000 CR4: 0000000000040660 Stack: 0000000000000000 00000000fffffffa ffffffff8142bd40 0000007400000003 ffff8807b7d37c2c ffffffff00000001 0000000000000000 ffff881dae2120d0 ffffffff81015b07 00000003810d34e4 ffffffff8237c178 ffff881dae21afc0 Call Trace: Inexact backtrace: [<ffffffff8142bd40>] ? xen_poll_irq_timeout+0x40/0x50 [<ffffffff81015b07>] ? xen_qlock_wait+0x77/0x80 [<ffffffff810d3637>] ? __pv_queued_spin_lock_slowpath+0x227/0x260 [<ffffffff8119edb4>] ? queued_spin_lock_slowpath+0x7/0xa [<ffffffff811df626>] ? __purge_vmap_area_lazy+0x116/0x390 [<ffffffff810ac942>] ? ___might_sleep+0xe2/0x120 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed Code: 41 53 48 c7 c0 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 48 c7 c0 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc c c cc cc cc cc cc 51 After about 30 seconds or so, I note that there is a slight shift, in that this line: CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1 changes to something like: CPU: 15 PID: 1357 Comm: sshd Tainted: G L 4.4.180-102-default #1 The above log group continues to log, every few seconds, forever, until I kill the guest. The physical host is not impacted. It remains up, alive, connected to its networks, and functioning properly. The only output I get on the physical host is a one-time report: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state Steps I have taken: 1, I initially thought this might be a problem in openssh. There are reports on the net about a vulnerability in openssh versions prior to 7.3 (42.3 is at 7.2p2) in which a long string can be sent to sshd from the outside world and cause it to spin (and lock) out of control. I disabled that version of sshd on the guest, and installed the (then) latest version of openssh: 8.1p1. The problem persisted. 2. I have tried ifdown/ifup from within the guest to try to make the network reconnect, to no avail. 3. I have tried to unplug and replug the guest network from the host, to make the network reconnect, also to no avail. 4. Thinking that this might be related to recent reports of issues with grant tables in the blkfront driver, I checked usage on the DomU when it was spinning: /usr/sbin/xen-diag gnttab_query_size 6 domid=6: nr_frames=15, max_nr_frames=32 So it doesn't seem to be related to that issue. (DomID was 6 because four crashes since last physical host reboot, ugh.) I have adjusted the physical host to 256 as a number of people online recommended, but just did that this morning. I now see: /usr/sbin/xen-diag gnttab_query_size 2 domid=2: nr_frames=14, max_nr_frames=256 but again the exhaustion issue doesn't *seem* to have happened here... although I could be wrong. Because of the nature of the problem, the Xen oncrash action isn't triggered. The host can't tell that the guest has crashed, and it really hasn't crashed, it's just spinning, eating up CPU. The only thing I can do is destroy the guest, and recreate it. So where I am now is I'm remotely polling the machine from distant lands, every 60 seconds, and having myself paged out every time there is a crash in the hope I can try something else... but I am now out of something elses to try. The guest in question is a high-profile, high-usage guest for a client that expects 24/7 uptime... so this is, to me, rather a serious problem. I realize that the solution here may be "just upgrade the guest to 15.1"; however, I have two problems: 1. I cannot upgrade the guest until I have support from my customer's staff who can address their software compatibility issues pertaining to the differences in Python, PHP, etc., between 42.3 and 15.1... so I'm stuck here for a while. 2. In the process of running a new 15.1 guest on yet a third, different 15.1 host, I experienced a lockup on the guest there - which had no log entries at all and may be unrelated; however, it, too, was only running network/disk-intensive rsyncs at the time. I may need to post a seprate thread about that later; I'm not done taking debugging steps there yet. In short, I'm out of options. It seems to me that running a 42.3 guest on a 15.1 host shoud work, yet I am having these crashes. Thank you in advance for any help/guidance/pointers/cluebats. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org