Dear OpenSuse Team: This is a followup to my two previous threads about 42.3 and 15.1 DomU machines hanging under high disk load. I repeat my thanks to all of you who responded to me and tried to help me with this. What follows below is an update/new report on this problem, which is now no longer limited to just me, and which I can now duplicate even on fresh loads. Here we go: Problem: Xen DomU guests randomly stall under high network/disk loads. Dom0 is not affected. Randomly means anywhere between 1 hour and 14 days after guest boot - the time seems to shorten with (or perhaps the problem is triggered by) increased network (and possibly disk) activity. Symptoms on the DomU Guest: 1. Guest machine performs normally until the moment of failure. No abnormal log/console entries exist. 2. At the moment of failure, the guest's network goes offline. No abnormal log/console entries are written at that moment. 3. Processes which were trying to connect to the network start to consume increasing amounts of CPU. 4. Load average of the guest starts to increase, continuing upward without apparent bound. 5. If a high-priority bash shell is left logged in on the guest hvc0 console, some commands might still be runnable; most are not. 6. If the guest console is not logged in, the console is frozen and doesn't even echo characters. 7. Some guests will output messages on the console like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! 8. On some others, I will also see output like: BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s! 9. Sometimes there is no output at all on the console. Symptoms on the Dom0 Host: The host is unaffected. The only indication anything is happening on the host are two log entries in /var/log/messages: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state Circumstances when the problem first occurred: 1. All hosts and guests were previously on OpenSuse 42.3 (Linux 4.4.180, Xen 4.9.4) 2. I upgraded one physical host to OpenSuse 15.1 (Linux 4.12.14, Xen 4.12.1). 3. The guest(s) started malfunctioning at that point. Immediate steps taken while the guest was stalled, which did not help: 1. Tried to use high-priority shell on guest console to kill high-CPU processes; they were unkillable. 2. Tried to use guest console to stop and restart network; commands were unresponsive. 3. Tried to use guest console to shutdown/init 0. This caused console to be terminated, but guest would not otherwise shutdown. 4. Tried to use host xl interface to unplug/replug network bridges. This appeared to work from host side, but guest was unaffected. One thing which I accidentally discovered that *did* help: 1. Tried ending xl trigger nmi from the host to the guest. When I trigger the stalled guest with an NMI, I get its attention. The guest will print the following on the console: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue In some cases (pattern not yet known), the guest will then immediately come back online: The network will come back online, and all processes will slowly stop consuming CPU, and things will return to normal. Existing network connections were obviously terminated, but new connections are accepted. In that case, it's like the guest just magically comes back to life. When this works, the host log shows: vif vif-6-0 vif6.0: Guest Rx ready br0: port 2(vif6.0) entered blocking state br0: port 2(vif6.0) entered forwarding state And all seems well... as if the guest had never stalled. However, this is not reliable. In some cases, the guest will print those messages, but the processes will NOT recover, and the network will come back impaired, or not at all. When that happens, repeated NMIs do not help: If the guest doesn't recover the first time, it doesn't recover at all. The *only* reliable way to fix this is to destroy the guest completely, and recreate it. The guest will then run fine... until the next stall. But of course a hard-destroy can't be a healthy thing for a guest machine, and that's really not a solution. Long-term mitigation steps which were tried which did not help. 1. Thought this was an SSH bug (since sshd processes were consuming high CPU), installed latest OpenSSH. 2. Though maybe a PV problem, tried under HVM instead of PV. 3. Noted a problem with grant frames, applied the recommended fix for that, my config now looks like: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 4. Thought maybe a kernel module might be at issue, reviewed list with OpenSuse team. 5. Thought this might be a kernel mismatch, was referred to a new kernel by OpenSuse team (4.12.13 for OpenSuse 42.3). That changed some of the console output behavior and logging, but did not solve the problem. 6. Thought this might be a general OS mismatch, tried upgrading the guest to OpenSuse 15.1/Linux 4.12.14/Xen 4.12.1. In this configuration, no console or log output is generated on the guest at all, it just stalls. 7. Assumed (incorrectly, it now turns out) that something was just "wrong" with my guest, tried a fresh load of host, and a fresh guest. I thought that would solve it, but to my sadness, it did not. Which means that this is now a reproducible bug. Steps to reproduce: 1. Get a server. I'm using a Dell PowerEdge R720, but this has happened on several different Dell models. My current server has two 16-core CPUs, and 128GB of RAM. 2. Load OpenSuse 15.1 (which includes Xen 4.12.1) on the server. Boot it up in Xen Dom0/host mode. 3. Create a new guest machine, also with 15.1/4.12.1. 4. Fire up the guest. 5. Put a lot of data on the guest (my guest has 3 TB of files and data). 6. Plug a crossover cable into your server, and plug the other end into some other Linux machine. 7. From that other machine, start pounding the guest. An rsync of the entire data partition is a great way to trigger this. If I run several outbound rsyncs together, I can crash my guest in under 48 hours. If I run 4 or 5, I can often crash the guest in just 2 hours. If you don't want to damage your SSDs on your other machine, here's my current command (my host is 192.168.1.10, and my guest is 192.168.1.11, so I plug in some other machine and make it, say, 192.168.1.12, and then run: nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & Where /a is my directory full of user data. 4-6 of these running simultaneously will bring the guest to its knees in short order. On my most recent test, I did the NMI trigger thing, and found this in the guest's /var/log/messages after sending the trigger (i've removed tagging and timestamping for clarity:) Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange powersaving mode enabled? Dazed and confused, but trying to continue clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask: ffffffffffffffff clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074 mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap workqueue mm_percpu_wq: flags=0x8 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 pending: vmstat_update workqueue writeback: flags=0x4e pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256 in-flight: 28593:wb_workfn workqueue kblockd: flags=0x18 pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256 pending: blk_mq_run_work_fn, blk_mq_timeout_work pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125 That led me to search around, and I tripped over this: https://wiki.debian.org/Xen/Clocksource , which describes a guest hanging with the message "clocksource/0: Time went backwards/" Although I did not see this message, and this is not directly on point with OpenSuse (since our /proc structure doesn't include some of the switches mentioned), I did notice clocksource references in the logs (see above), and that led me back to: https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/ch..., and specifically the tsc_mode setting. I have no idea if it's relevant, but I since I'm out of ideas and have nothing better to try, I have now booted my guest into tsc_mode=1 and am stress testing it to see if it fares any better this way. I had originally thought that I was the only person with this problem, and that's why I thought a fresh guest would fix it - the problem followed me around different servers, so that made sense. Over the past weeks I've set up a fresh guest on my fresh host, and, just on a whim, did the above stress testing on it... it lasted for 36 hours. That led me to start searching the net again, and I found that, just in the past few weeks, another person has reported what seems to be the same problem, only he reported it to Xen-users (so I assume he's on a different distro - see https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html for the overall message.) Nobody has responded to him yet... but I'm about to... I'm going to send this report there too. He states that the problem is limited to Xen 4.12 and 4.13, and that rolling back to Xen 4.11 solves the problem. Meanwhile, I'm hoping that these updated details and history spark something new for some of you here. Do any of you have any ideas on this? Any thoughts, guidance, musings, etc., anything at all would be appreciated. As I said I'm sending this to Xen now too. But since it never occurred to me that I could upgrade 42.3 to a 4.12 kernel - that was a surprise to me! - I guess there is another question I'd like to ask this group: It is possible, somehow, under OpenSuse 15.1, to "downgrade" Xen to 4.11, without harming anything else? And if so, what procedure would I follow for that? Again, thank you all for your patience and help, I am very grateful! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org