[opensuse-virtual] Xen 4.12 DomU hang / freeze / stall under high network/disk load (single thread now, and updates)
Dear OpenSuse Team: This is a followup to my two previous threads about 42.3 and 15.1 DomU machines hanging under high disk load. I repeat my thanks to all of you who responded to me and tried to help me with this. What follows below is an update/new report on this problem, which is now no longer limited to just me, and which I can now duplicate even on fresh loads. Here we go: Problem: Xen DomU guests randomly stall under high network/disk loads. Dom0 is not affected. Randomly means anywhere between 1 hour and 14 days after guest boot - the time seems to shorten with (or perhaps the problem is triggered by) increased network (and possibly disk) activity. Symptoms on the DomU Guest: 1. Guest machine performs normally until the moment of failure. No abnormal log/console entries exist. 2. At the moment of failure, the guest's network goes offline. No abnormal log/console entries are written at that moment. 3. Processes which were trying to connect to the network start to consume increasing amounts of CPU. 4. Load average of the guest starts to increase, continuing upward without apparent bound. 5. If a high-priority bash shell is left logged in on the guest hvc0 console, some commands might still be runnable; most are not. 6. If the guest console is not logged in, the console is frozen and doesn't even echo characters. 7. Some guests will output messages on the console like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! 8. On some others, I will also see output like: BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s! 9. Sometimes there is no output at all on the console. Symptoms on the Dom0 Host: The host is unaffected. The only indication anything is happening on the host are two log entries in /var/log/messages: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state Circumstances when the problem first occurred: 1. All hosts and guests were previously on OpenSuse 42.3 (Linux 4.4.180, Xen 4.9.4) 2. I upgraded one physical host to OpenSuse 15.1 (Linux 4.12.14, Xen 4.12.1). 3. The guest(s) started malfunctioning at that point. Immediate steps taken while the guest was stalled, which did not help: 1. Tried to use high-priority shell on guest console to kill high-CPU processes; they were unkillable. 2. Tried to use guest console to stop and restart network; commands were unresponsive. 3. Tried to use guest console to shutdown/init 0. This caused console to be terminated, but guest would not otherwise shutdown. 4. Tried to use host xl interface to unplug/replug network bridges. This appeared to work from host side, but guest was unaffected. One thing which I accidentally discovered that *did* help: 1. Tried ending xl trigger nmi from the host to the guest. When I trigger the stalled guest with an NMI, I get its attention. The guest will print the following on the console: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue In some cases (pattern not yet known), the guest will then immediately come back online: The network will come back online, and all processes will slowly stop consuming CPU, and things will return to normal. Existing network connections were obviously terminated, but new connections are accepted. In that case, it's like the guest just magically comes back to life. When this works, the host log shows: vif vif-6-0 vif6.0: Guest Rx ready br0: port 2(vif6.0) entered blocking state br0: port 2(vif6.0) entered forwarding state And all seems well... as if the guest had never stalled. However, this is not reliable. In some cases, the guest will print those messages, but the processes will NOT recover, and the network will come back impaired, or not at all. When that happens, repeated NMIs do not help: If the guest doesn't recover the first time, it doesn't recover at all. The *only* reliable way to fix this is to destroy the guest completely, and recreate it. The guest will then run fine... until the next stall. But of course a hard-destroy can't be a healthy thing for a guest machine, and that's really not a solution. Long-term mitigation steps which were tried which did not help. 1. Thought this was an SSH bug (since sshd processes were consuming high CPU), installed latest OpenSSH. 2. Though maybe a PV problem, tried under HVM instead of PV. 3. Noted a problem with grant frames, applied the recommended fix for that, my config now looks like: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 4. Thought maybe a kernel module might be at issue, reviewed list with OpenSuse team. 5. Thought this might be a kernel mismatch, was referred to a new kernel by OpenSuse team (4.12.13 for OpenSuse 42.3). That changed some of the console output behavior and logging, but did not solve the problem. 6. Thought this might be a general OS mismatch, tried upgrading the guest to OpenSuse 15.1/Linux 4.12.14/Xen 4.12.1. In this configuration, no console or log output is generated on the guest at all, it just stalls. 7. Assumed (incorrectly, it now turns out) that something was just "wrong" with my guest, tried a fresh load of host, and a fresh guest. I thought that would solve it, but to my sadness, it did not. Which means that this is now a reproducible bug. Steps to reproduce: 1. Get a server. I'm using a Dell PowerEdge R720, but this has happened on several different Dell models. My current server has two 16-core CPUs, and 128GB of RAM. 2. Load OpenSuse 15.1 (which includes Xen 4.12.1) on the server. Boot it up in Xen Dom0/host mode. 3. Create a new guest machine, also with 15.1/4.12.1. 4. Fire up the guest. 5. Put a lot of data on the guest (my guest has 3 TB of files and data). 6. Plug a crossover cable into your server, and plug the other end into some other Linux machine. 7. From that other machine, start pounding the guest. An rsync of the entire data partition is a great way to trigger this. If I run several outbound rsyncs together, I can crash my guest in under 48 hours. If I run 4 or 5, I can often crash the guest in just 2 hours. If you don't want to damage your SSDs on your other machine, here's my current command (my host is 192.168.1.10, and my guest is 192.168.1.11, so I plug in some other machine and make it, say, 192.168.1.12, and then run: nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & Where /a is my directory full of user data. 4-6 of these running simultaneously will bring the guest to its knees in short order. On my most recent test, I did the NMI trigger thing, and found this in the guest's /var/log/messages after sending the trigger (i've removed tagging and timestamping for clarity:) Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange powersaving mode enabled? Dazed and confused, but trying to continue clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask: ffffffffffffffff clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074 mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap workqueue mm_percpu_wq: flags=0x8 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 pending: vmstat_update workqueue writeback: flags=0x4e pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256 in-flight: 28593:wb_workfn workqueue kblockd: flags=0x18 pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256 pending: blk_mq_run_work_fn, blk_mq_timeout_work pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125 That led me to search around, and I tripped over this: https://wiki.debian.org/Xen/Clocksource , which describes a guest hanging with the message "clocksource/0: Time went backwards/" Although I did not see this message, and this is not directly on point with OpenSuse (since our /proc structure doesn't include some of the switches mentioned), I did notice clocksource references in the logs (see above), and that led me back to: https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/ch..., and specifically the tsc_mode setting. I have no idea if it's relevant, but I since I'm out of ideas and have nothing better to try, I have now booted my guest into tsc_mode=1 and am stress testing it to see if it fares any better this way. I had originally thought that I was the only person with this problem, and that's why I thought a fresh guest would fix it - the problem followed me around different servers, so that made sense. Over the past weeks I've set up a fresh guest on my fresh host, and, just on a whim, did the above stress testing on it... it lasted for 36 hours. That led me to start searching the net again, and I found that, just in the past few weeks, another person has reported what seems to be the same problem, only he reported it to Xen-users (so I assume he's on a different distro - see https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html for the overall message.) Nobody has responded to him yet... but I'm about to... I'm going to send this report there too. He states that the problem is limited to Xen 4.12 and 4.13, and that rolling back to Xen 4.11 solves the problem. Meanwhile, I'm hoping that these updated details and history spark something new for some of you here. Do any of you have any ideas on this? Any thoughts, guidance, musings, etc., anything at all would be appreciated. As I said I'm sending this to Xen now too. But since it never occurred to me that I could upgrade 42.3 to a 4.12 kernel - that was a surprise to me! - I guess there is another question I'd like to ask this group: It is possible, somehow, under OpenSuse 15.1, to "downgrade" Xen to 4.11, without harming anything else? And if so, what procedure would I follow for that? Again, thank you all for your patience and help, I am very grateful! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org
Am Thu, 13 Feb 2020 12:44:12 -0800 schrieb Glen <glenbarney@gmail.com>: TLDR;
It is possible, somehow, under OpenSuse 15.1, to "downgrade" Xen to 4.11, without harming anything else? And if so, what procedure would I follow for that?
There are various variants of unmodified xen.git#staging-N.M snapshots available: zypper ar -cf \ http://download.opensuse.org/repositories/home:/olh:/xen-buildrequires/openS... \ xen_buildrequires zypper ar -cf \ http://download.opensuse.org/repositories/home:/olh:/xen-4.11/SLE_15 \ xen_411 zypper dup --allow-vendor-change --from xen_buildrequires --from xen_411 This repo contains also snapshots of libvirt.git#master, qemu.git#master and qemu.git#stable-N.M. These extra qemu packages are not strictly required because xen.git already includes a private copy of some qemu.git snapshot. The included libvirt snapshot may not work as expected. Tools that rely on libvirt may also not work with the included libvirt snapshot. Olaf
On Mon, Feb 17, 2020 at 4:28 AM Olaf Hering <olaf@aepfle.de> wrote:
There are various variants of unmodified xen.git#staging-N.M snapshots available: zypper ar -cf \ http://download.opensuse.org/repositories/home:/olh:/xen-buildrequires/openS... \ xen_buildrequires zypper ar -cf \ http://download.opensuse.org/repositories/home:/olh:/xen-4.11/SLE_15 \ xen_411 zypper dup --allow-vendor-change --from xen_buildrequires --from xen_411 This repo contains also snapshots of libvirt.git#master, qemu.git#master and qemu.git#stable-N.M. These extra qemu packages are not strictly required because xen.git already includes a private copy of some qemu.git snapshot. The included libvirt snapshot may not work as expected. Tools that rely on libvirt may also not work with the included libvirt snapshot. Olaf
Olaf - Thank you so much! I will look into these right away! All - There has been a bit of discussion on the Xen list about this. No developers have looked at it yet, but the current thinking seems to be that there is a bug that was introduced in 4.12. Another user reported that rolling back to 4.11 has fixed things. I have a dedicated host/guest pair that I'm using for testing. Under a fresh host and guest load of OpenSuse 15.1/Xen 4.12, the guest ran under my contrived stress testing at a higher load average, and the guest would stall in 24-48 hours. I reloaded just the host to OpenSuse 15.0/Xen 4.10, and repeated the same test. The guest (still at 15.1) is showing a lower load average, and *subjectively* seems to be performing much better than it did under 15.1/4.12... and has survived into day 3 so far without stalling. I have no idea what this means yet, but I hope I can get some insight from the Xen team as testing continues. Meanwhile, Olaf, thank you for this. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org
Dear OpenSuse Team: On Thu, Feb 13, 2020 at 12:44 PM Glen <glenbarney@gmail.com> wrote:
This is a followup to my two previous threads about 42.3 and 15.1 DomU machines hanging under high disk load. I repeat my thanks to all of you who responded to me and tried to help me with this.
Problem: Xen DomU guests randomly stall under high network/disk loads. Dom0 is not affected. Randomly means anywhere between 1 hour and 14 days after guest boot - the time seems to shorten with (or perhaps the problem is triggered by) increased network (and possibly disk) activity.
I wanted to report back here and let you all know what we've found so far. After I raised this on the xen-users list, a number of other people stepped in and said that they were having similar problems. Guided by members of their community, we've done a bit of poking and testing. You can see all the details in their archive ( https://lists.xenproject.org/archives/html/xen-users/2020-02/ ) but the short of it is: 1. Several people had the same problem, where guests randomly stall/freeze. 2. The problem seems NOT to be related to OpenSuse itself, or OpenSuse version, or Linux Kernel version. 3. The problem DOES seem to be related to Xen version, and to a specific module, the "credit-scheduler-2". Reverting to any Xen prior to Xen 4.12 fixes the problem (thank you Olaf!) but that's suboptimal in terms of wanting to run the latest software versions (or, more to the point, the production versions that come with the Leap releases.) With that in mind, the best fix so far seems to be to add "sched=credit" to GRUB_CMDLINE_XEN in /etc/default/grub, as in: GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256 sched=credit" Adding that last parameter causes Xen to boot with the older "credit scheduler" instead of the newer "credit2 scheduler", and that seems to resolve the problem for everyone. (I'm still running longer stress tests on my guests, but the results are encouraging so far.) Members of the Xen community have suggested making sched=credit the default until problems with credit-scheduler-2 are fixed. I have no idea how that would apply to us, but felt I should mention that, as it seems important. I'm now inquiring of their users list when and how to file a bug report for this, and I'll continue to try to work with them, but I wanted to get this back to this group and list in case anyone else needs this info, and/or in case anyone here has any comments or additional guidance. Thank you again to all of you who have helped me during this extended incident. I am very grateful to this community for all of your help! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org
On Mon, 2020-02-17 at 18:39 -0800, Glen wrote:
Dear OpenSuse Team:
Hello here as well, :-)
1. Several people had the same problem, where guests randomly stall/freeze. 2. The problem seems NOT to be related to OpenSuse itself, or OpenSuse version, or Linux Kernel version. 3. The problem DOES seem to be related to Xen version, and to a specific module, the "credit-scheduler-2".
Reverting to any Xen prior to Xen 4.12 fixes the problem (thank you Olaf!) but that's suboptimal in terms of wanting to run the latest software versions (or, more to the point, the production versions that come with the Leap releases.)
With that in mind, the best fix so far seems to be to add "sched=credit" to GRUB_CMDLINE_XEN in /etc/default/grub, as in:
GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256 sched=credit"
Right.
Members of the Xen community have suggested making sched=credit the default until problems with credit-scheduler-2 are fixed. I have no idea how that would apply to us, but felt I should mention that, as it seems important.
"Us" being? openSUSE? Well, I guess that if upstream changes the default, we'll do the same, unless there are very good reasons not to. But I don't think this is what we should focus on at this stage...
I'm now inquiring of their users list when and how to file a bug report for this, and I'll continue to try to work with them, but I wanted to get this back to this group and list in case anyone else needs this info, and/or in case anyone here has any comments or additional guidance.
You have done a good job at reporting a bug on the Xen developer mailing list. The issue managed (although after a little while, but not at all because of your fault) to catch the Xen scheduler developers' and maintainers' attention (Juergen Gross and myself :-)). So, nothing much more to say than <<Thanks! Keep up the good work of reporting nasty issues!>> :-) Speaking of that (I mean, of keeping up the good work), as asked on the upstream ML already, if/when you still have the chance to reproduce the problematic situation, when running on Credit2, we'll be very happy to see some more logs. Also, it's more than ok to continue this conversation here, but since it is an upstream issue, please, do report about any update and logs that you can capture directly upstream (i.e., on the xen-devel mailing list). Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Hello Dario - Thank you for this email! On Wed, Feb 19, 2020 at 5:36 AM Dario Faggioli <dfaggioli@suse.com> wrote:
Members of the Xen community have suggested making sched=credit the default until problems with credit-scheduler-2 are fixed. I have no idea how that would apply to us, but felt I should mention that, as it seems important. "Us" being? openSUSE? Well, I guess that if upstream changes the default, we'll do the same, unless there are very good reasons not to.
Yes. Please pardon my imprecise wording... this has a been a months-long journey for me, and my fatigue was showing.
You have done a good job at reporting a bug on the Xen developer mailing list. The issue managed (although after a little while, but not at all because of your fault) to catch the Xen scheduler developers' and maintainers' attention (Juergen Gross and myself :-)). So, nothing much more to say than <<Thanks! Keep up the good work of reporting nasty issues!>> :-)
Thank you!
Speaking of that (I mean, of keeping up the good work), as asked on the upstream ML already, if/when you still have the chance to reproduce the problematic situation, when running on Credit2, we'll be very happy to see some more logs. Also, it's more than ok to continue this conversation here, but since it is an upstream issue, please, do report about any update and logs that you can capture directly upstream (i.e., on the xen-devel mailing list).
Will do on both counts. Best regards, Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org
On Mon, Feb 17, 2020 at 6:39 PM Glen <glenbarney@gmail.com> wrote:
Dear OpenSuse Team: On Thu, Feb 13, 2020 at 12:44 PM Glen <glenbarney@gmail.com> wrote:
Problem: Xen DomU guests randomly stall under high network/disk loads. Dom0 is not affected. Randomly means anywhere between 1 hour and 14 days after guest boot - the time seems to shorten with (or perhaps the problem is triggered by) increased network (and possibly disk) activity. With that in mind, the best fix so far seems to be to add "sched=credit" to GRUB_CMDLINE_XEN in /etc/default/grub, as in: GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256 sched=credit" Adding that last parameter causes Xen to boot with the older "credit scheduler" instead of the newer "credit2 scheduler", and that seems to resolve the problem for everyone. (I'm still running longer stress tests on my guests, but the results are encouraging so far.)
It's been a week, during which I've stress-tested a number of guests on OpenSuse 15.1 + Xen 4.12.1 + boot option sched=credit . All guests survived without any issue at all. Others on the Xen lists are reporting similar successes. I've completed my client migration, and those guests are also surviving nicely. So I"m tagging this [SOLVED] and tossing in a few subject line keywords for the archives, to help other find this, and I've now brought back one of my (no longer in use) hosts to credit2 so I can start crashing that guest and hopefully capturing debugging information from it. I'll report that data directly to xen-devel since that seems to not be an OpenSuse problem. Apart from OpenSuse possibly making "sched=credit" a Xen boot default now, via an update or whatever, there is one other small item of interest for this group. It has been suggested ( https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00052.html ) that Xen 4.12.2 contains fixes to these and other problems. I do not know how these things work within OpenSuse, but I wonder if bringing in Xen 4.12.2 as an update to OpenSuse 15.1 might be a useful thing to do? I through that out there for all of you who know much more about these things than I. Anyway, I expect this to be my last message on this topic to this group, but wanted to thank all of you who responded to me over the past months again for your help and patience. Everyone is busy, and I was and am very grateful for your time! Best regards, Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org
Hi again Glen, On Mon, 2020-02-24 at 08:06 -0800, Glen wrote:
I've completed my client migration, and those guests are also surviving nicely. So I"m tagging this [SOLVED] and tossing in a few subject line keywords for the archives, to help other find this, and I've now brought back one of my (no longer in use) hosts to credit2 so I can start crashing that guest and hopefully capturing debugging information from it.
That is great to hear... looking forward for some more info and logs on the subject.
I'll report that data directly to xen-devel since that seems to not be an OpenSuse problem.
Yep, this is definitely an upstream issue, and we definitely should talk on upstream MLs. It would be useful, though, to have a bug opened for it for better tracking (see below).
Apart from OpenSuse possibly making "sched=credit" a Xen boot default now, via an update or whatever, there is one other small item of interest for this group. It has been suggested ( https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00052.html ) that Xen 4.12.2 contains fixes to these and other problems. I do not know how these things work within OpenSuse, but I wonder if bringing in Xen 4.12.2 as an update to OpenSuse 15.1 might be a useful thing to do? I through that out there for all of you who know much more about these things than I.
In general, fixes are applied (backported). Whatever the fix for this would be to really revert to "sched=credit" or some actual changes to Credit2 (or whatever else), I think you can expect for it to be backported to any supported distribution.
Anyway, I expect this to be my last message on this topic to this group, but wanted to thank all of you who responded to me over the past months again for your help and patience.
Actually, thank you again for reporting the problem and for you will to help us find the solution. As I mentioned above, would you be willing to open a bug, on https://bugzilla.opensuse.org/ about this? It is indeed something we'll deal with upstream, but I think it would still be useful. In the description of the problem, you can mention the fact that the problem is being investigated and link the threads on the xen-devel and xen-users mailing lists. Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
On Thu, Feb 27, 2020 at 2:57 AM Dario Faggioli <dfaggioli@suse.com> wrote:
Hi again Glen,
Hello! Thank you so much for your email!
That is great to hear... looking forward for some more info and logs on the subject.
I shall be sure to include you.
Yep, this is definitely an upstream issue, and we definitely should talk on upstream MLs. It would be useful, though, to have a bug opened for it for better tracking (see below).
Acknowledged.
In general, fixes are applied (backported). Whatever the fix for this would be to really revert to "sched=credit" or some actual changes to Credit2 (or whatever else), I think you can expect for it to be backported to any supported distribution.
Thank you!
Actually, thank you again for reporting the problem and for you will to help us find the solution.
Thank you for your kindness! I've been using OpenSuse forever, and it always just works. This is the first time I've had to deal with something this serious, and I am so grateful to all of the members of the OpenSuse community for their kindness and patience as I worked through this!
As I mentioned above, would you be willing to open a bug, on https://bugzilla.opensuse.org/ about this? It is indeed something we'll deal with upstream, but I think it would still be useful. In the description of the problem, you can mention the fact that the problem is being investigated and link the threads on the xen-devel and xen-users mailing lists.
I have done as you have requested, and the link to the bug is here: https://bugzilla.opensuse.org/show_bug.cgi?id=1165206 Thank you! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-virtual+owner@opensuse.org
participants (3)
-
Dario Faggioli
-
Glen
-
Olaf Hering