- openSUSE Virtual

Repository signing key of Backport repository missing in Vagrant box for Leap 15.5
by Gregor Dschung 03 Apr '24

03 Apr '24

Hi, I'm using this box with Vagrant and VirtualBox: https://download.opensuse.org/repositories/Virtualization:/Appliances:/Imag… Unfortunately, the image doesn't include the current repository signing key of http://download.opensuse.org/update/leap/15.5/backports/ IMHO `zypper ref` should be called even on the first run without any interaction. Only the key of the backport repository is missing. localhost:/etc/zypp # zypper ref New repository or package signing key received: Repository: Update repository of openSUSE Backports Key Fingerprint: F044 C2C5 07A1 262B 538A AADD 8A49 EB03 25DB 7AE0 Key Name: openSUSE:Backports OBS Project <openSUSE:Backports@build.opensuse.org> Key Algorithm: RSA 4096 Key Created: Wed May 10 14:46:12 2023 Key Expires: Sun May 9 14:46:12 2027 Rpm Name: gpg-pubkey-25db7ae0-645bae34 Note: Signing data enables the recipient to verify that no modifications occurred after the data were signed. Accepting data with no, wrong or unknown signature can lead to a corrupted system and in extreme cases even to a system compromise. Note: A GPG pubkey is clearly identified by its fingerprint. Do not rely on the key's name. If you are not sure whether the presented key is authentic, ask the repository provider or check their web site. Many providers maintain a web page showing the fingerprints of the GPG keys they are using. Do you want to reject the key, trust temporarily, or trust always? [r/t/a/?] (r): a Retrieving repository 'Update repository of openSUSE Backports' metadata .......................................................................................................................[done] Building repository 'Update repository of openSUSE Backports' cache ............................................................................................................................[done] Retrieving repository 'Non-OSS Repository' metadata ............................................................................................................................................[done] Building repository 'Non-OSS Repository' cache .................................................................................................................................................[done] Retrieving repository 'Open H.264 Codec (openSUSE Leap)' metadata ..............................................................................................................................[done] Building repository 'Open H.264 Codec (openSUSE Leap)' cache ...................................................................................................................................[done] Retrieving repository 'Main Repository' metadata ...............................................................................................................................................[done] Building repository 'Main Repository' cache ....................................................................................................................................................[done] Retrieving repository 'Update repository with updates from SUSE Linux Enterprise 15' metadata ..................................................................................................[done] Building repository 'Update repository with updates from SUSE Linux Enterprise 15' cache .......................................................................................................[done] Retrieving repository 'Main Update Repository' metadata ........................................................................................................................................[done] Building repository 'Main Update Repository' cache .............................................................................................................................................[done] Retrieving repository 'Update Repository (Non-Oss)' metadata ...................................................................................................................................[done] Building repository 'Update Repository (Non-Oss)' cache ........................................................................................................................................[done] All repositories have been refreshed. With Leap 15.4, I do not have this issue: localhost:/home/vagrant # zypper ref Warning: The gpg key signing file 'repomd.xml' has expired. Repository: Update repository of openSUSE Backports Key Fingerprint: 637B 32FF 3D83 F07A 7AE1 C40A 9C21 4D40 6517 6565 Key Name: openSUSE:Backports OBS Project <openSUSE:Backports@build.opensuse.org> Key Algorithm: RSA 2048 Key Created: Fri Nov 26 14:26:23 2021 Key Expires: Sun Feb 4 14:26:23 2024 (EXPIRED) Rpm Name: gpg-pubkey-65176565-61a0ee8f Retrieving repository 'Update repository of openSUSE Backports' metadata ........................................................................[done] Building repository 'Update repository of openSUSE Backports' cache .............................................................................[done] Retrieving repository 'Non-OSS Repository' metadata .............................................................................................[done] Building repository 'Non-OSS Repository' cache ..................................................................................................[done] Retrieving repository 'Main Repository' metadata ................................................................................................[done] Building repository 'Main Repository' cache .....................................................................................................[done] Retrieving repository 'Update repository with updates from SUSE Linux Enterprise 15' metadata ...................................................[done] Building repository 'Update repository with updates from SUSE Linux Enterprise 15' cache ........................................................[done] Retrieving repository 'Main Update Repository' metadata .........................................................................................[done] Building repository 'Main Update Repository' cache ..............................................................................................[done] Retrieving repository 'Update Repository (Non-Oss)' metadata ....................................................................................[done] Building repository 'Update Repository (Non-Oss)' cache .........................................................................................[done] All repositories have been refreshed. Btw., where are the repository keys located in the filesystem? "The internet" is suggesting "/var/cache/zypp/pubkeys", but there is no such directory under /var/cache/zypp. Regards, Gregor

3 2

Re: reducing the number of repositories in Virtualization project
by Olaf Hering 09 Dec '21

09 Dec '21

Am Thu, 9 Dec 2021 12:40:20 -0600 schrieb Larry Finger <Larry.Finger(a)lwfinger.net>: > The package contains some files that are flagged as forbidden in the Backports > project. Yes, the maintainers of that project want it that way. I added the workaround to prjconf after I added the repositories, so the first iteration failed. Olaf

1 0

reducing the number of repositories in Virtualization project
by Olaf Hering 09 Dec '21

09 Dec '21

https://build.opensuse.org/repositories/Virtualization is a development project, which means it may need just the 'openSUSE_Tumbleweed' and the 'Assist_repo_checker' repository. But over time this prj got many more repositories. I think the main reason is virtualbox, which provides a KMP. Therefore I propose to move all the repositories to a new project, Virtualization:virtualbox or Virtualization:VirtualBox. This new prj will contain a single pkg container 'virtualbox', which is just a _link back to Virtualization/virtualbox. That way the history is preserved, and it will continue to be the development pkg for Factory. Once this is move is done, unneeded repositories in Virtualization can be removed. This is supposed to reduce the burden for individual other packages which may target just Tumbleweed. Olaf

2 3

Re: reducing the number of repositories in Virtualization project
by Olaf Hering 09 Dec '21

09 Dec '21

Am Wed, 8 Dec 2021 20:16:05 -0600 schrieb Larry Finger <Larry.Finger(a)lwfinger.net>: > Please use Virtualization:virtualbox. This project is now yours: https://build.opensuse.org/project/show/Virtualization:virtualbox I think the kbuild pkg can be removed again, unless SLE_12_SP5 is supposed to be a target. If it is, gsoup and byacc need to be added, and a few tweaks in prjconf need to be done to deal with the X11 package names. Olaf

1 0

[opensuse-virtual] qemu 4.2.0 builds with GCC10 -> fail; GCC9 ok ?
by PGNet Dev 24 Mar '20

24 Mar '20

just starting exploring builds with GCC-10 building qemu 4.2.0, linked from Virtualization:qemu @ my obs, https://build.opensuse.org/project/show/home:pgnd:Virtualization:qemu with gcc --version gcc (SUSE Linux) 10.0.1 20200320 (experimental) [revision 7d4549b2cd209eb621453ce13be7ffd84ffa720a] build on OBS currently fails ... [ 244s ]cc -iquote /home/abuild/rpmbuild/BUILD/mybuilddir/contrib/vhost-user-gpu -iquote contrib/vhost-user-gpu -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/tcg -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/tcg/i386-I/home/abuild/rpmbuild/BUILD/qemu-4.2.0/linux-headers -I/home/abuild/rpmbuild/BUILD/mybuilddir/linux-headers -iquote . -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0-iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/accel/tcg -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/include -I/usr/include/pixman-1-I/home/abuild/rpmbuild/BUILD/qemu-4.2.0/dtc/libfdt -Werror -pthread -I/usr/include/glib-2.0-I/usr/lib64/glib-2.0/include -pthread -I/usr/include/glib-2.0-I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64-mcx16-D_GNU_SOURCE -D_FILE_OFFSET_BITS=64-D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -std=gnu99-fmessage-length=0-grecord-gcc-switches -O2-Wall -D_FORTIFY_SOURCE=2-fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value -Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -I/usr/include/p11-kit-1-I/usr/include/libpng16-I/usr/include/libdrm -pthread -DOPENSSL_LOAD_CONF -I/usr/include/spice-server -I/usr/include/cacard -I/usr/include/nss3-I/usr/include/nspr4-I/usr/include/glib-2.0-I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1-I/usr/include/spice-1-I/home/abuild/rpmbuild/BUILD/qemu-4.2.0/tests -MMD -MP -MT contrib/vhost-user-gpu/virgl.o -MF contrib/vhost-user-gpu/virgl.d -O2-U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2-g -I/usr/include/virgl -c -o contrib/vhost-user-gpu/virgl.o /home/abuild/rpmbuild/BUILD/qemu-4.2.0/contrib/vhost-user-gpu/virgl.c [ 244s ]/home/abuild/rpmbuild/BUILD/qemu-4.2.0/scsi/qemu-pr-helper.c: In function 'multipath_pr_out': [ 244s ]/home/abuild/rpmbuild/BUILD/qemu-4.2.0/scsi/qemu-pr-helper.c:523:32: error: array subscript <unknown> is outside array bounds of 'struct transportid *[ 0 ]' [ -Werror=array-bounds ][ 244s ]523| paramp.trnptid_list[ paramp.num_transportid++ ]= id; [ 244s ]| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ [ 244s ]In file included from /home/abuild/rpmbuild/BUILD/qemu-4.2.0/scsi/qemu-pr-helper.c:36: [ 244s ]/usr/include/mpath_persist.h:168:22: note: while referencing 'trnptid_list' [ 244s ]168| struct transportid *trnptid_list[]; [ 244s ]| ^~~~~~~~~~~~ [ 244s ]/home/abuild/rpmbuild/BUILD/qemu-4.2.0/scsi/qemu-pr-helper.c:424:35: note: defined here 'paramp' [ 244s ]424| struct prout_param_descriptor paramp; [ 244s ]| ^~~~~~ [ 245s ]cc -iquote /home/abuild/rpmbuild/BUILD/mybuilddir/contrib/vhost-user-gpu -iquote contrib/vhost-user-gpu -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/tcg -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/tcg/i386-I/home/abuild/rpmbuild/BUILD/qemu-4.2.0/linux-headers -I/home/abuild/rpmbuild/BUILD/mybuilddir/linux-headers -iquote . -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0-iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/accel/tcg -iquote /home/abuild/rpmbuild/BUILD/qemu-4.2.0/include -I/usr/include/pixman-1-I/home/abuild/rpmbuild/BUILD/qemu-4.2.0/dtc/libfdt -Werror -pthread -I/usr/include/glib-2.0-I/usr/lib64/glib-2.0/include -pthread -I/usr/include/glib-2.0-I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64-mcx16-D_GNU_SOURCE -D_FILE_OFFSET_BITS=64-D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -std=gnu99-fmessage-length=0-grecord-gcc-switches -O2-Wall -D_FORTIFY_SOURCE=2-fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value -Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -I/usr/include/p11-kit-1-I/usr/include/libpng16-I/usr/include/libdrm -pthread -DOPENSSL_LOAD_CONF -I/usr/include/spice-server -I/usr/include/cacard -I/usr/include/nss3-I/usr/include/nspr4-I/usr/include/glib-2.0-I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1-I/usr/include/spice-1-I/home/abuild/rpmbuild/BUILD/qemu-4.2.0/tests -MMD -MP -MT contrib/vhost-user-gpu/vugbm.o -MF contrib/vhost-user-gpu/vugbm.d -O2-U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2-g -c -o contrib/vhost-user-gpu/vugbm.o /home/abuild/rpmbuild/BUILD/qemu-4.2.0/contrib/vhost-user-gpu/vugbm.c [ 245s ]cc1: all warnings being treated as errors [ 245s ]make: *** [ /home/abuild/rpmbuild/BUILD/qemu-4.2.0/rules.mak:69: scsi/qemu-pr-helper.o ]Error 1[ 245s ]make: *** Waiting for unfinished jobs.... [ 245s ]error: Bad exit status from /var/tmp/rpm-tmp.Z5bWqs (%build) ... with gcc*9*, all's good. there are certainly GCC10 bugs that are affecting kernel builds; not surprising that there might be others. _completely_ unclear, yet, if THIS^^ is GCC10- or qemu- related ... or perhaps prjconf on obs ... anyone had luck with a qemu pkg build wigh GCC 10? or off-the-bat recognize the issue?

2 2

[opensuse-virtual] Xen 4.12 DomU hang / freeze / stall under high network/disk load (single thread now, and updates)
by Glen 27 Feb '20

27 Feb '20

Dear OpenSuse Team: This is a followup to my two previous threads about 42.3 and 15.1 DomU machines hanging under high disk load. I repeat my thanks to all of you who responded to me and tried to help me with this. What follows below is an update/new report on this problem, which is now no longer limited to just me, and which I can now duplicate even on fresh loads. Here we go: Problem: Xen DomU guests randomly stall under high network/disk loads. Dom0 is not affected. Randomly means anywhere between 1 hour and 14 days after guest boot - the time seems to shorten with (or perhaps the problem is triggered by) increased network (and possibly disk) activity. Symptoms on the DomU Guest: 1. Guest machine performs normally until the moment of failure. No abnormal log/console entries exist. 2. At the moment of failure, the guest's network goes offline. No abnormal log/console entries are written at that moment. 3. Processes which were trying to connect to the network start to consume increasing amounts of CPU. 4. Load average of the guest starts to increase, continuing upward without apparent bound. 5. If a high-priority bash shell is left logged in on the guest hvc0 console, some commands might still be runnable; most are not. 6. If the guest console is not logged in, the console is frozen and doesn't even echo characters. 7. Some guests will output messages on the console like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! 8. On some others, I will also see output like: BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s! 9. Sometimes there is no output at all on the console. Symptoms on the Dom0 Host: The host is unaffected. The only indication anything is happening on the host are two log entries in /var/log/messages: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state Circumstances when the problem first occurred: 1. All hosts and guests were previously on OpenSuse 42.3 (Linux 4.4.180, Xen 4.9.4) 2. I upgraded one physical host to OpenSuse 15.1 (Linux 4.12.14, Xen 4.12.1). 3. The guest(s) started malfunctioning at that point. Immediate steps taken while the guest was stalled, which did not help: 1. Tried to use high-priority shell on guest console to kill high-CPU processes; they were unkillable. 2. Tried to use guest console to stop and restart network; commands were unresponsive. 3. Tried to use guest console to shutdown/init 0. This caused console to be terminated, but guest would not otherwise shutdown. 4. Tried to use host xl interface to unplug/replug network bridges. This appeared to work from host side, but guest was unaffected. One thing which I accidentally discovered that *did* help: 1. Tried ending xl trigger nmi from the host to the guest. When I trigger the stalled guest with an NMI, I get its attention. The guest will print the following on the console: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue In some cases (pattern not yet known), the guest will then immediately come back online: The network will come back online, and all processes will slowly stop consuming CPU, and things will return to normal. Existing network connections were obviously terminated, but new connections are accepted. In that case, it's like the guest just magically comes back to life. When this works, the host log shows: vif vif-6-0 vif6.0: Guest Rx ready br0: port 2(vif6.0) entered blocking state br0: port 2(vif6.0) entered forwarding state And all seems well... as if the guest had never stalled. However, this is not reliable. In some cases, the guest will print those messages, but the processes will NOT recover, and the network will come back impaired, or not at all. When that happens, repeated NMIs do not help: If the guest doesn't recover the first time, it doesn't recover at all. The *only* reliable way to fix this is to destroy the guest completely, and recreate it. The guest will then run fine... until the next stall. But of course a hard-destroy can't be a healthy thing for a guest machine, and that's really not a solution. Long-term mitigation steps which were tried which did not help. 1. Thought this was an SSH bug (since sshd processes were consuming high CPU), installed latest OpenSSH. 2. Though maybe a PV problem, tried under HVM instead of PV. 3. Noted a problem with grant frames, applied the recommended fix for that, my config now looks like: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 4. Thought maybe a kernel module might be at issue, reviewed list with OpenSuse team. 5. Thought this might be a kernel mismatch, was referred to a new kernel by OpenSuse team (4.12.13 for OpenSuse 42.3). That changed some of the console output behavior and logging, but did not solve the problem. 6. Thought this might be a general OS mismatch, tried upgrading the guest to OpenSuse 15.1/Linux 4.12.14/Xen 4.12.1. In this configuration, no console or log output is generated on the guest at all, it just stalls. 7. Assumed (incorrectly, it now turns out) that something was just "wrong" with my guest, tried a fresh load of host, and a fresh guest. I thought that would solve it, but to my sadness, it did not. Which means that this is now a reproducible bug. Steps to reproduce: 1. Get a server. I'm using a Dell PowerEdge R720, but this has happened on several different Dell models. My current server has two 16-core CPUs, and 128GB of RAM. 2. Load OpenSuse 15.1 (which includes Xen 4.12.1) on the server. Boot it up in Xen Dom0/host mode. 3. Create a new guest machine, also with 15.1/4.12.1. 4. Fire up the guest. 5. Put a lot of data on the guest (my guest has 3 TB of files and data). 6. Plug a crossover cable into your server, and plug the other end into some other Linux machine. 7. From that other machine, start pounding the guest. An rsync of the entire data partition is a great way to trigger this. If I run several outbound rsyncs together, I can crash my guest in under 48 hours. If I run 4 or 5, I can often crash the guest in just 2 hours. If you don't want to damage your SSDs on your other machine, here's my current command (my host is 192.168.1.10, and my guest is 192.168.1.11, so I plug in some other machine and make it, say, 192.168.1.12, and then run: nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & Where /a is my directory full of user data. 4-6 of these running simultaneously will bring the guest to its knees in short order. On my most recent test, I did the NMI trigger thing, and found this in the guest's /var/log/messages after sending the trigger (i've removed tagging and timestamping for clarity:) Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange powersaving mode enabled? Dazed and confused, but trying to continue clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask: ffffffffffffffff clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074 mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap workqueue mm_percpu_wq: flags=0x8 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 pending: vmstat_update workqueue writeback: flags=0x4e pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256 in-flight: 28593:wb_workfn workqueue kblockd: flags=0x18 pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256 pending: blk_mq_run_work_fn, blk_mq_timeout_work pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125 That led me to search around, and I tripped over this: https://wiki.debian.org/Xen/Clocksource , which describes a guest hanging with the message "clocksource/0: Time went backwards/" Although I did not see this message, and this is not directly on point with OpenSuse (since our /proc structure doesn't include some of the switches mentioned), I did notice clocksource references in the logs (see above), and that led me back to: https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/c…, and specifically the tsc_mode setting. I have no idea if it's relevant, but I since I'm out of ideas and have nothing better to try, I have now booted my guest into tsc_mode=1 and am stress testing it to see if it fares any better this way. I had originally thought that I was the only person with this problem, and that's why I thought a fresh guest would fix it - the problem followed me around different servers, so that made sense. Over the past weeks I've set up a fresh guest on my fresh host, and, just on a whim, did the above stress testing on it... it lasted for 36 hours. That led me to start searching the net again, and I found that, just in the past few weeks, another person has reported what seems to be the same problem, only he reported it to Xen-users (so I assume he's on a different distro - see https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html for the overall message.) Nobody has responded to him yet... but I'm about to... I'm going to send this report there too. He states that the problem is limited to Xen 4.12 and 4.13, and that rolling back to Xen 4.11 solves the problem. Meanwhile, I'm hoping that these updated details and history spark something new for some of you here. Do any of you have any ideas on this? Any thoughts, guidance, musings, etc., anything at all would be appreciated. As I said I'm sending this to Xen now too. But since it never occurred to me that I could upgrade 42.3 to a 4.12 kernel - that was a surprise to me! - I guess there is another question I'd like to ask this group: It is possible, somehow, under OpenSuse 15.1, to "downgrade" Xen to 4.11, without harming anything else? And if so, what procedure would I follow for that? Again, thank you all for your patience and help, I am very grateful! Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org

3 8

[opensuse-virtual] full cstate/cpufreq/cpupower support withOUT Xen 4.13 on kernel 5.4.14; WITH xen, none at all. bug or config?
by PGNet Dev 28 Jan '20

28 Jan '20

( I'd already posted this at xen-users; no traction to date ) I'm running linux kernel lsb_release -rd Description: openSUSE Leap 15.1 Release: 15.1 uname -rm 5.4.14-24.gfc4ea7a-default x86_64 dmesg | grep DMI: [ 0.000000] DMI: Supermicro X10SAT/X10SAT, BIOS 3.0 05/26/2015 cat /proc/cpuinfo | grep "model name" | head -n 1 model name : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz kernel & xen are pkg-installed from my KernelStable and Virtualization-Xen repos @ OBS. BIOS *is* setup for max cstate support. Xeon E3-1220 does support intel_pstate driver. Testing first, (1) boot, NO XEN pstate driver's init'd dmesg | egrep -i "intel_pstate" [ 6.132964] intel_pstate: Intel P-state driver initializing pstate/cstate info cat /sys/module/intel_idle/parameters/max_cstate 9 cd /sys/devices/system/cpu/cpu0/cpuidle for state in state{0..9} do echo c-$state `cat $state/name` `cat $state/latency` done c-state0 POLL 0 c-state1 C1 2 c-state2 C1E 10 c-state3 C3 33 c-state4 C6 133 c-state5 C7s 166 cat: state6/name: No such file or directory cat: state6/latency: No such file or directory c-state6 cat: state7/name: No such file or directory cat: state7/latency: No such file or directory c-state7 cat: state8/name: No such file or directory cat: state8/latency: No such file or directory c-state8 cat: state9/name: No such file or directory cat: state9/latency: No such file or directory c-state9 cpufreq scaling info's available, cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 800 MHz - 3.50 GHz available cpufreq governors: performance powersave current policy: frequency should be within 800 MHz and 3.50 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: Unable to call hardware current CPU frequency: 799 MHz (asserted by call to kernel) boost state support: Supported: yes Active: yes & scaling is in effect, cat /proc/cpuinfo | grep MHz cpu MHz : 798.106 cpu MHz : 798.129 cpu MHz : 798.964 cpu MHz : 798.154 (2) boot, WITH Xen 4.13 rpm -qa | grep -i xen | sort grub2-x86_64-xen-2.04-lp151.6.5.noarch xen-4.13.0_04-lp151.688.2.x86_64 xen-libs-4.13.0_04-lp151.688.2.x86_64 xen-tools-4.13.0_04-lp151.688.2.x86_64 Xen cmd line includes, grep options= /boot/grub2/xen-4.13.0_04-lp151.688.cfg [config.1] options=dom0=pvh dom0-iommu=map-reserved dom0_mem=4016M,max:4096M dom0_max_vcpus=4 cpufreq=xen cpuidle ucode=scan ... intel_pstate support is now DISABLED for this cpu xl dmesg | grep pstate [ 6.851121] intel_pstate: CPU model not supported c-states report, xenpm get-cpuidle-states 0 All C-states allowed cpu id : 0 total C-states : 6 idle time(ms) : 45780911 C0 : transition [ 3204855] residency [ 160769 ms] C1 : transition [ 9204] residency [ 1018 ms] C2 : transition [ 10181] residency [ 2848 ms] C3 : transition [ 22784] residency [ 17236 ms] C4 : transition [ 7181] residency [ 11793 ms] C5 : transition [ 3155504] residency [ 45668846 ms] pc2 : [ 1685 ms] pc3 : [ 30695 ms] cc3 : [ 16858 ms] cc6 : [ 11640 ms] cc7 : [ 45602872 ms] NO cpupower frequency-info is available cpupower frequency-info analyzing CPU 0: no or unknown cpufreq driver is active on this CPU CPUs which run at the same hardware frequency: Not Available CPUs which need to have their frequency coordinated by software: Not Available maximum transition latency: Cannot determine or is not supported. Not Available available cpufreq governors: Not Available Unable to determine current policy current CPU frequency: Unable to call hardware current CPU frequency: Unable to call to kernel boost state support: Supported: no Active: no and scaling is NOT in effect cat /proc/cpuinfo | grep MHz cpu MHz : 3092.828 cpu MHz : 3092.828 cpu MHz : 3092.828 cpu MHz : 3092.828 attempt to add acpi-cpufreq module fails lsmod | grep acpi-cpufreq (empty) find /lib/modules/ | grep acpi-cpu /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko modprobe acpi-cpufreq modprobe: ERROR: could not insert 'acpi_cpufreq': No such device insmod /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko insmod: ERROR: could not insert module /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko: No such device Is this bug, or config? -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org

2 7

[opensuse-virtual] xen 4.13 + kernel 5.4.11 'APIC Error ... FATAL PAGE FAULT' on reboot? non-Xen reboot's ok.
by PGNet Dev 16 Jan '20

16 Jan '20

I've a recently upgraded (pkgs via zypper), running Xen 4.13.0_04 server, on EFI hardware + Intel Xeon E3 CPU, with kernel 5.4.11-24.g2d02eb4-default on lsb_release -rd Description: openSUSE Leap 15.1 Release: 15.1 It boots as always, with no issue Welcome to GRUB! Please press t to show the boot menu on this console Xen 4.13.0_04-lp151.688 (c/s ) EFI loader Using configuration file 'xen-4.13.0_04-lp151.688.cfg' vmlinuz-5.4.11-24.g2d02eb4-default: 0x000000008b7c0000-0x000000008c04efb8 initrd-5.4.11-24.g2d02eb4-default: 0x000000008a4a5000-0x000000008b7bfe28 0x0000:0x00:0x19.0x0: ROM: 0x10000 bytes at 0x928a9018 0x0000:0x04:0x00.0x0: ROM: 0x8000 bytes at 0x928a0018 0x0000:0x10:0x00.0x0: ROM: 0x10800 bytes at 0x92885018 __ __ \ \/ /___ _ __ \ // _ \ '_ \ / \ __/ | | | /_/\_\___|_| |_| _ _ _ _____ ___ ___ _ _ _ _ ____ _ __ ___ ___ | || | / |___ / / _ \ / _ \| || | | |_ __ / | ___|/ | / /_ ( _ ) ( _ ) | || |_ | | |_ \| | | | | | | | || |_ __| | '_ \| |___ \| || '_ \ / _ \ / _ \ |__ _|| |___) | |_| | | |_| |__ _|__| | |_) | |___) | || (_) | (_) | (_) | |_|(_)_|____(_)___/___\___/ |_| |_| .__/|_|____/|_(_)___/ \___/ \___/ |_____| |_| (XEN) [00000026c8dc8909] Xen version 4.13.0_04-lp151.688 (abuild(a)suse.de) (gcc (SUSE Linux) 9.2.1 20200109 [gcc-9-branch revi sion 280039]) debug=n Wed Jan 8 11:43:04 UTC 2020 (XEN) [00000026cbd609dc] Latest ChangeSet: (XEN) [00000026cc9505ea] Bootloader: EFI (XEN) [00000026cd46f20f] Command line: dom0=pvh dom0-iommu=map-reserved dom0_mem=4016M,max:4096M bootscrub=false dom0_max_vcp us=4 vga=gfx-1920x1080x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 sched=credit2 ucode=scan log_buf_len=16M loglvl=warning guest_loglvl=none/warning noreboot=false iommu=verbose sync_console=false ... on exec of cmdline shutdown from shell, shutdown -r now the system DOES reboot, but first throws an APIC error -- only if running Xen, reboot with no-hypervisor has not probs 1st step, here's the current, relevant _log_ trace ... [ OK ] Reached target Shutdown. [ 343.932856] watchdog: watchdog0: watchdog did not stop! [ 346.871303] watchdog: watchdog0: watchdog did not stop! dracut Warning: Killing all remaining processes mdadm: stopped /dev/md4 mdadm: stopped /dev/md3 mdadm: stopped /dev/md2 mdadm: stopped /dev/md1 mdadm: stopped /dev/md0 Rebooting. [ 352.396918] reboot: Restarting system (XEN) [2020-01-15 15:01:26] Hardware Dom0 shutdown: rebooting machine (XEN) [2020-01-15 15:01:26] APIC error on CPU0: 40(00) (XEN) [2020-01-15 15:01:26] ----[ Xen-4.13.0_04-lp151.688 x86_64 debug=n Not tainted ]---- (XEN) [2020-01-15 15:01:26] CPU: 0 (XEN) [2020-01-15 15:01:26] RIP: e008:[<0000000000000000>] 0000000000000000 (XEN) [2020-01-15 15:01:26] RFLAGS: 0000000000010202 CONTEXT: hypervisor (XEN) [2020-01-15 15:01:26] rax: 0000000000000286 rbx: 0000000000000000 rcx: 0000000000000000 (XEN) [2020-01-15 15:01:26] rdx: 000000009e5ca7a0 rsi: 0000000000000000 rdi: 0000000000000000 (XEN) [2020-01-15 15:01:26] rbp: 0000000000000000 rsp: ffff83008ca2fa48 r8: ffff83008ca2fa90 (XEN) [2020-01-15 15:01:26] r9: ffff83008ca2fa80 r10: 0000000000000000 r11: 0000000000000000 (XEN) [2020-01-15 15:01:26] r12: 0000000000000000 r13: ffff83008ca2fb00 r14: ffff83008ca2ffff (XEN) [2020-01-15 15:01:26] r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000001526e0 (XEN) [2020-01-15 15:01:26] cr3: 00000008492ed000 cr2: ffffffffeef3f286 (XEN) [2020-01-15 15:01:26] fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 (XEN) [2020-01-15 15:01:26] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [2020-01-15 15:01:26] Xen code around <0000000000000000> (0000000000000000) [fault on access]: (XEN) [2020-01-15 15:01:26] -- -- -- -- -- -- -- -- <00> 80 00 f0 f3 ee 00 f0 c3 e2 00 f0 f3 ee 00 f0 (XEN) [2020-01-15 15:01:26] Xen stack trace from rsp=ffff83008ca2fa48: (XEN) [2020-01-15 15:01:26] 000000009e5ca3c9 ffff82d08036681f ffff82d08036682b 0000000000000000 (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff83008ca2fa88 0000000000000000 00000000001526e0 (XEN) [2020-01-15 15:01:26] ffff82d0802758cd 0000000000000286 0000000000000286 0000000000000000 (XEN) [2020-01-15 15:01:26] 000000009efe42f6 0000000000000000 0000000000000000 ffff83008ca2fb00 (XEN) [2020-01-15 15:01:26] ffff82d08036331b 0000000000152660 ffff82d0803636ae 0000000000000000 (XEN) [2020-01-15 15:01:26] ffff83008ca2fb48 0000000000000000 ffff82d080363688 000000008ca1f000 (XEN) [2020-01-15 15:01:26] ffff82d080937a98 000000fe00000000 ffff82d08029e41a 000000000000e008 (XEN) [2020-01-15 15:01:26] 0000000000000287 ffff830000000000 0000000000000000 0000000000000065 (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08029dd3c 000000008036682b 000082d08036681f (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08093dd00 0000000000000000 0000000000000000 (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08029de17 ffff82d08023a742 ffff82d0809378c8 (XEN) [2020-01-15 15:01:26] ffff82d08093dd00 ffff82d08027ff48 ffff82d080000000 ffff83008ca2fd98 (XEN) [2020-01-15 15:01:26] ffff82d0000000fb ffff82d08036681f ffff82d08036682b ffff82d08036681f (XEN) [2020-01-15 15:01:27] ffff82d08036682b ffff82d08036681f ffff82d08036682b 0000000000000000 (XEN) [2020-01-15 15:01:27] 0000000000000000 0000000000000000 0000000000000000 ffff83008ca2ffff (XEN) [2020-01-15 15:01:27] 0000000000000000 ffff82d080366894 ffff82d08095e860 ffff830849340424 (XEN) [2020-01-15 15:01:27] ffff82d08095e820 ffff83008ca2fd98 ffff82d080823460 0000000000000002 (XEN) [2020-01-15 15:01:27] 0000000000000000 0000000000000000 0000000000000000 ffff83008ca2fd98 (XEN) [2020-01-15 15:01:27] 00000000000000c1 00000000000003f8 00000000000003fa ffff82d080823460 (XEN) [2020-01-15 15:01:27] 0000000000000004 000000fb00000000 ffff82d08024b590 000000000000e008 (XEN) [2020-01-15 15:01:27] Xen call trace: (XEN) [2020-01-15 15:01:27] [<0000000000000000>] R 0000000000000000 (XEN) [2020-01-15 15:01:27] [<000000009e5ca3c9>] S 000000009e5ca3c9 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d0802758cd>] S arch/x86/flushtlb.c#pre_flush+0x3d/0x70 (XEN) [2020-01-15 15:01:27] [<ffff82d08036331b>] S arch/x86/efi/runtime.c#efi_rs_enter.part.0+0xfb/0x130 (XEN) [2020-01-15 15:01:27] [<ffff82d0803636ae>] S efi_reset_system+0x4e/0x90 (XEN) [2020-01-15 15:01:27] [<ffff82d080363688>] S efi_reset_system+0x28/0x90 (XEN) [2020-01-15 15:01:27] [<ffff82d08029e41a>] S smp_send_stop+0xba/0xc0 (XEN) [2020-01-15 15:01:27] [<ffff82d08029dd3c>] S machine_restart+0x1fc/0x2d0 (XEN) [2020-01-15 15:01:27] [<ffff82d08029de17>] S arch/x86/shutdown.c#__machine_restart+0x7/0x10 (XEN) [2020-01-15 15:01:27] [<ffff82d08023a742>] S smp_call_function_interrupt+0x52/0x90 (XEN) [2020-01-15 15:01:27] [<ffff82d08027ff48>] S do_IRQ+0x2d8/0x760 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d080366894>] S common_interrupt+0x104/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08024b590>] S drivers/char/ns16550.c#ns16550_interrupt+0xc0/0xe0 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d080280107>] S do_IRQ+0x497/0x760 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d080366894>] S common_interrupt+0x104/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d0802d74dd>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x25d/0x3c0 (XEN) [2020-01-15 15:01:27] [<ffff82d0802d74d8>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x258/0x3c0 (XEN) [2020-01-15 15:01:27] [<ffff82d08023cca9>] S common/tasklet.c#tasklet_softirq_action+0x39/0x60 (XEN) [2020-01-15 15:01:27] [<ffff82d0802700ec>] S arch/x86/domain.c#idle_loop+0x8c/0xa0 (XEN) [2020-01-15 15:01:27] (XEN) [2020-01-15 15:01:27] Pagetable walk from ffffffffeef3f286: (XEN) [2020-01-15 15:01:27] L4[0x1ff] = 0000000000000000 ffffffffffffffff (XEN) [2020-01-15 15:01:27] (XEN) [2020-01-15 15:01:27] **************************************** (XEN) [2020-01-15 15:01:27] Panic on CPU 0: (XEN) [2020-01-15 15:01:27] FATAL PAGE FAULT (XEN) [2020-01-15 15:01:27] [error_code=0002] (XEN) [2020-01-15 15:01:27] Faulting linear address: ffffffffeef3f286 (XEN) [2020-01-15 15:01:27] **************************************** (XEN) [2020-01-15 15:01:27] (XEN) [2020-01-15 15:01:27] Reboot in five seconds... ... Is this a known/fixable issue? If more, specific info is needed here, pls let me know what to provide. -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org

2 1

[opensuse-virtual] 15.1 Xen DomU freezing under high network/disk load, recoverable with NMI trigger
by Glen 08 Jan '20

08 Jan '20

Dear OpenSuse Team: Earlier today I sent a request to the list about a 42.3 DomU crashing. Olaf replied, and I've installed the new kernel, and I'll watch and see. I'm very grateful for the help. I'm sorry to post a second question, but I'm having a simliar-but-different problem on a different host and guest, and have reached an impasse. A few weeks ago, I took a copy of our crashy 42.3 DomU guest, and copied it to a new guest, just making a copy of the disk, and changing the name and IP address and booting it on a different physical host. I then did zypper dup from 42.3->15.0->15.1. This was intended as a "test run", if you like, to predict how client software would react to the upgrade. So now I have an upgraded *copy* of my machine, running 15.1. All patches applied. And it's running on a different host, which was a fresh load of 15.1, also with all patches applied. Linux host 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64 x86_64 x86_64 GNU/Linux This guest has a problem as well, in that, under sustained high network/disk loads, the guest freezes up completely. This happened twice today - I can pretty much *make* it happen just by starting a local rsync (i.e. on a crossover cable) of it's main big data partition (3TB).... about every other attempt to copy the entire partition via rsync over ssh will freeze the guest. I get the same annoyingly terse message on the physical host: [92630.531549] vif vif-6-0 vif6.0: Guest Rx stalled [92630.531613] br0: port 2(vif6.0) entered disabled state but, unlike my 42.3 guest, this one gives *no* log outputs or data at all on the guest. No BUG, no CPU lockup, no kernel traceback, nothing. I left a high priority shell on the hvc0 console, which, when the 42.3 guest had its problem, was still sort of responsive, and I left "top -n 1; sleep 15" running in a while true loop on it... but it was completely frozen. I could see the final top before the hang, and there was nothing to suggest a problem. The guest just... hangs. Unlike the frozen 42.3 guest, which showed pretty much continuous "run" state, the 15.1 guest seems to do the more-or-less "normal" behavior in xentop - switching between "b" and "r" modes, and showing normal utilization patterns. But the guest itself is stuck tight. I have seen mentions about the grant frames issue, and I did apply the higher value to the host and guests: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 but this is still happening. Now here's the crazy part: I sat around trying to poke at the frozen guest and try different things before destroying it, and, skimming down my "xl" choices, I found "xl trigger". I had already tried pausing and unpausing the guest - that did nothing. But when I tried xl trigger (at random I tried the first option, so: xl trigger 6 nmi), the guest CAME BACK ONLINE! It said this: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue on the console. I also saw it in /var/log/messages, followed by: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 554c072567f2 wd_last: 54137c19cb3c mask: ffffffffffffffff clocksource: 'tsc' cs_now: 2d696bb78816d4 cs_last: 2d6640097d695e mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog On the host in /var/log/messages, I saw: [93760.637546] vif vif-6-0 vif6.0: Guest Rx ready [93760.637595] br0: port 2(vif6.0) entered blocking state [93760.637598] br0: port 2(vif6.0) entered forwarding state And, apart from the rsync/sshd processes (which I suspect the remote side had given up), everything else came right back online. MySQL, for example, was still running on the guest without issue, in fact apart from the log entries I cite above, there was no indication that the machine had even been broken. The 5- and 10-minute load averages were way up in the 30s... but everything else was fine. Prior to the freeze, the guest was continuously showing a load average of about 3.0 - with the rsync and sshd processes in run mode, and that's it - just as I'd expect. The guest is provisioned thusly: name="gggv" description="gggv" uuid="13289776-1c74-9ade-4242-8f7453249832" memory=90112 maxmem=90112 vcpus=26 cpus="4-31" on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" extra="elevator=noop" disk=[ '/b/xen/gggv/gggv.root,raw,xvda1,w', '/b/xen/gggv/gggv.swap,raw,xvda2,w', '/b/xen/gggv/gggv.xa,raw,xvdb1,w', ] vif=[ 'mac=00:16:3f:04:05:41,bridge=br0', 'mac=00:16:3f:04:05:42,bridge=br1', ] vfb=['type=vnc,vncunused=1'] and is also the only guest running on its host. The host has: GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256" and is in every other respect an essentially fresh 15.1 load. I'm thinking that this is a different problem than my 42.3 guest problem, but I don't know what to do with it. My next move was to make sure my hardware (and data, and OS!) were okay. So I moved the root filesystem of my upgraded guest aside, and did a fresh load of 15.1 onto a new root filesystem. When I use *that* to boot my guest, it seems to be stable. High network activity does not appear to stop it - I've done 5 or 6 copies of my huge filesystem in that mode without issue. Of course I'd like to do more cycles to be sure, but it seems stable compared to when the upgraded root is in place, when I can make the machine freeze up on almost every (or every other) copy attempt. The only thing I can think of that is different here, then, would be that, maybe, since the guest has been zypper dup'ed over time all the way back from 13.2 (the last time it was built fresh), that maybe it's inherited some old garbage that could be causing this. It seems to me that a zypper dup'ed guest "should" work properly, especially when it is the same version and kernel as the physical host; but, again (sorry) I have these freezes. So just for laughs, I ran an lsmod in both modes, and sorted and diffed them: The "clean" guest (which appears to be stable), has these four kernel modules not present on the upgraded guest: iptable_raw nf_conntrack_ftp nf_nat_ftp xt_CT The "dup'ped" guest (which seems to be crashable on a large local rsync) has these modules not present on a clean install: auth_rpcgss br_netfilter bridge grace intel_rapl ipt_MASQUERADE llc lockd nf_conntrack_netlink nf_nat_masquerade_ipv4 nfnetlink nfs_acl nfsd overlay sb_edac stp sunrpc veth xfrm_algo xfrm_user xt_addrtype xt_nat Both guests share these additional sysctl.conf settings: kernel.panic = 5 vm.panic_on_oom = 2 vm.swappiness = 0 net.ipv6.conf.all.autoconf = 0 net.ipv6.conf.default.autoconf = 0 net.ipv6.conf.eth0.autoconf = 0 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_tw_reuse = 0 The dup'ped guest has these additional sysctl.conf settings: net.ipv4.tcp_tw_recycle = 0 net.core.netdev_max_backlog=300000 net.core.somaxconn = 2048 net.core.rmem_max=67108864 net.core.wmem_max=67108864 net.ipv4.ip_local_port_range=15000 65000 net.ipv4.tcp_sack=0 net.ipv4.tcp_rmem=4096 87380 67108864 net.ipv4.tcp_wmem=4096 65536 67108864 all of which have, more or less, worked well in the past (when everything was on 42.3) and may or may not be relevant here. I'm sorry, I feel like I'm missing something obvious here, but I can't see it. I would be grateful for any guidance or insights into this. Yes, in addition to trying to upgrade my client in place to 15.1, I could just build a new guest by hand, but that would be even more time-consuming and seems like it should not be necessary. If I might quote from the kernel, "Dazed and confused, but trying to continue" is exactly how I'm feeling here. Why could this guest be hanging? Why does an NMI bring it back? What should I do next? Anything anyone would be willing to point me to or suggest would be gratefully appreciated. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org

4 12

[opensuse-virtual] 4.23 Xen DomU's crashing/hanging after upgrading Dom0 to 15.1
by Glen 07 Jan '20

07 Jan '20

Greetings all: I have a number of Xen hosts, and Xen guests on those hosts, all of which have been running reliably for users under 42.3 (and earlier 42.x versions) forever. Up until recently all hosts and guests were at 42.3, with all normal zypper updates applied, and running fine. Recently, the time came to upgrade to 15.1. I proceeded by upgrading the physical hosts to 15.1 first. Following that step, two of my largest and most high-volume 42.3 guests - on two entirely different physical hosts - started crashing every few days. The largest one crashes the most frequently, I'll focus on that. The physical host is a Dell R520 with (Xen showing) 32 CPUs and 128GB of RAM. Linux php1 4.12.14-lp151.28.32-default #1 SMP Wed Nov 13 07:50:15 UTC 2019 (6e1aaad) x86_64 x86_64 x86_64 GNU/Linux (XEN) Xen version 4.12.1_04-lp151.2.6 (abuild(a)suse.de) (gcc (SUSE Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n Tue Nov 5 15:20:06 UTC 2019 (XEN) Latest ChangeSet: (XEN) Bootloader: GRUB2 2.02 (XEN) Command line: dom0_mem=4096M dom0_max_vcpus=4 dom0_vcpus_pin The guest is the only guest on this host. (For legacy reasons, it uses physical partitions on the host directly, rather than file-backed storage, but I don't feel like that should be an issue...) name="ghv1" description="ghv1" uuid="c77f49c6-1f72-9ade-4242-8f18e72cbb32" memory=124000 maxmem=124000 vcpus=24 on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" extra="elevator=noop" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" disk=[ '/dev/sda3,,xvda1,w', '/dev/sda5,,xvda2,w', '/dev/sda6,,xvda3,w', '/dev/sdb1,,xvdb1,w', ] vif=[ 'mac=00:16:3e:75:92:4a,bridge=br0', 'mac=00:16:3e:75:92:4b,bridge=br1', ] vfb=['type=vnc,vncunused=1'] It runs: Linux ghv1 4.4.180-102-default #1 SMP Mon Jun 17 13:11:23 UTC 2019 (7cfa20a) x86_64 x86_64 x86_64 GNU/Linux A typical xentop looks like this: xentop - 07:13:03 Xen 4.12.1_04-lp151.2.6 3 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 134171184k total, 132922412k used, 1248772k free CPUs: 32 @ 2100MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETT X(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 607 12.9 4194304 3.1 no limit n/a 4 0 0 0 0 0 0 0 0 0 0 ghv1 -----r 18351 246.5 126976000 94.6 126977024 94.6 24 2 31 9108 3240011 4 0 1132578 205040 31572906 8389002 0 Xenstore --b--- 0 0.0 32760 0.0 1341440 1.0 1 0 0 0 0 0 0 0 0 0 0 This guest is high volume. It runs web servers, mail list servers, databases, docker containers, and is regularly and constantly backed up via rsync over ssh. It is still at 42.3. As mentioned above, when its host was also at 42.3, it ran flawlessly. Only after upgrading the host to 15.1 did these problems start. What happens is this: After between 2 and 10 days of uptime, the guest will start to malfunction, with the following symptoms: 1. All network interfaces (there are two, one main, and one local 192.168.x.x) will disconnect. 2. Guest will exhibit a number of sshd processes apparently running at high CPU. These processes cannot be killed. 3. Guest console will be filled with messages like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303] These messages print 2-3 times in groups every 1-2 seconds. There is no pattern to the CPU IDs, all CPUs appear to be involved. 4. It will become impossible to log in to the guest console. 5. If I already have a high-priority shell logged in on the console, I can run some commands, (like sync), but I cannot cause the guest to shut down (init 0, for example, hangs the console, but the guest does not exit.) I can issue kill commands as hinted above, but they are ignored. 6. xl shutdown is also ineffective. I must xl destroy the guest and re-create it. The guest logs show things like the following (I've removed the "kernel: and timestamps just to make this more clear"): INFO: rcu_sched self-detected stall on CPU 8-...: (15000 ticks this GP) idle=b99/140000000000001/0 softirq=12292658/12292658 fqs=13805 (t=15001 jiffies g=8219341 c=8219340 q=139284) Task dump for CPU 8: sshd R running task 0 886 1 0x0000008c ffffffff81e79100 ffffffff810f10c5 ffff881dae01b300 ffffffff81e79100 0000000000000000 ffffffff81f67e60 ffffffff810f8575 ffffffff81105d2a ffff88125e810280 ffff881dae003d40 0000000000000008 ffff881dae003d08 Call Trace: [<ffffffff8101b0c9>] dump_trace+0x59/0x350 [<ffffffff8101b4ba>] show_stack_log_lvl+0xfa/0x180 [<ffffffff8101c2b1>] show_stack+0x21/0x40 [<ffffffff810f10c5>] rcu_dump_cpu_stacks+0x75/0xa0 [<ffffffff810f8575>] rcu_check_callbacks+0x535/0x7f0 [<ffffffff811010c2>] update_process_times+0x32/0x60 [<ffffffff8110fd00>] tick_sched_handle.isra.17+0x20/0x50 [<ffffffff8110ff78>] tick_sched_timer+0x38/0x60 [<ffffffff81101cf3>] __hrtimer_run_queues+0xf3/0x2a0 [<ffffffff81102179>] hrtimer_interrupt+0x99/0x1a0 [<ffffffff8100d1dc>] xen_timer_interrupt+0x2c/0x170 [<ffffffff810e39ec>] __handle_irq_event_percpu+0x4c/0x1d0 [<ffffffff810e3b90>] handle_irq_event_percpu+0x20/0x50 [<ffffffff810e7407>] handle_percpu_irq+0x37/0x50 [<ffffffff810e3174>] generic_handle_irq+0x24/0x30 [<ffffffff8142dce8>] __evtchn_fifo_handle_events+0x168/0x180 [<ffffffff8142aec9>] __xen_evtchn_do_upcall+0x49/0x80 [<ffffffff8142cb4c>] xen_evtchn_do_upcall+0x2c/0x50 [<ffffffff81655c6e>] xen_do_hypervisor_callback+0x1e/0x40 DWARF2 unwinder stuck at xen_do_hypervisor_callback+0x1e/0x40 Leftover inexact backtrace: <IRQ> <EOI> [<ffffffff81073840>] ? leave_mm+0xc0/0xc0 [<ffffffff81115e63>] ? smp_call_function_many+0x203/0x260 [<ffffffff81073840>] ? leave_mm+0xc0/0xc0 [<ffffffff81115f26>] ? on_each_cpu+0x36/0x70 [<ffffffff81074078>] ? flush_tlb_kernel_range+0x38/0x60 [<ffffffff811a8c17>] ? __alloc_pages_nodemask+0x117/0xbf0 [<ffffffff811fd14a>] ? kmem_cache_alloc_node_trace+0xaa/0x4d0 [<ffffffff811df823>] ? __purge_vmap_area_lazy+0x313/0x390 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed The above comes in all at once. Then every second or two thereafter, I see this: NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303] Modules linked in: ipt_REJECT nf_reject_ipv4 binfmt_misc veth nf_conntrack_ipv6 nf_defrag_ ipv6 xt_pkttype ip6table_filter ip6_tables xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquera de_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 n f_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_na t nf_conntrack br_netfilter bridge stp llc overlay af_packet iscsi_ibft iscsi_boot_sysfs i ntel_rapl sb_edac edac_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev xen_fbfront drbg fb_sys_fops syscopyarea sysfillrect xen_kbdfront ansi_cprng sysim gblt xen_netfront aesni_intel aes_x86_64 lrw gf128mul glue_helper pcspkr ablk_helper crypt d nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache xen_blkfront sg dm_m ultipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1 task: ffff881a44554ac0 ti: ffff8807b7d34000 task.ti: ffff8807b7d34000 RIP: e030:[<ffffffff810013ac>] [<ffffffff810013ac>] xen_hypercall_sched_op+0xc/0x20 RSP: e02b:ffff8807b7d37c10 EFLAGS: 00000206 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810013ac RDX: 0000000000000000 RSI: ffff8807b7d37c30 RDI: 0000000000000003 RBP: 0000000000000071 R08: 0000000000000000 R09: ffff880191804908 R10: ffff880191804ab8 R11: 0000000000000206 R12: ffffffff8237c178 R13: 0000000000440000 R14: 0000000000000100 R15: 0000000000000000 FS: 00007ff9142bd700(0000) GS:ffff881dae200000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffedcb82f56 CR3: 0000001a1d860000 CR4: 0000000000040660 Stack: 0000000000000000 00000000fffffffa ffffffff8142bd40 0000007400000003 ffff8807b7d37c2c ffffffff00000001 0000000000000000 ffff881dae2120d0 ffffffff81015b07 00000003810d34e4 ffffffff8237c178 ffff881dae21afc0 Call Trace: Inexact backtrace: [<ffffffff8142bd40>] ? xen_poll_irq_timeout+0x40/0x50 [<ffffffff81015b07>] ? xen_qlock_wait+0x77/0x80 [<ffffffff810d3637>] ? __pv_queued_spin_lock_slowpath+0x227/0x260 [<ffffffff8119edb4>] ? queued_spin_lock_slowpath+0x7/0xa [<ffffffff811df626>] ? __purge_vmap_area_lazy+0x116/0x390 [<ffffffff810ac942>] ? ___might_sleep+0xe2/0x120 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed Code: 41 53 48 c7 c0 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 48 c7 c0 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc c c cc cc cc cc cc 51 After about 30 seconds or so, I note that there is a slight shift, in that this line: CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1 changes to something like: CPU: 15 PID: 1357 Comm: sshd Tainted: G L 4.4.180-102-default #1 The above log group continues to log, every few seconds, forever, until I kill the guest. The physical host is not impacted. It remains up, alive, connected to its networks, and functioning properly. The only output I get on the physical host is a one-time report: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state Steps I have taken: 1, I initially thought this might be a problem in openssh. There are reports on the net about a vulnerability in openssh versions prior to 7.3 (42.3 is at 7.2p2) in which a long string can be sent to sshd from the outside world and cause it to spin (and lock) out of control. I disabled that version of sshd on the guest, and installed the (then) latest version of openssh: 8.1p1. The problem persisted. 2. I have tried ifdown/ifup from within the guest to try to make the network reconnect, to no avail. 3. I have tried to unplug and replug the guest network from the host, to make the network reconnect, also to no avail. 4. Thinking that this might be related to recent reports of issues with grant tables in the blkfront driver, I checked usage on the DomU when it was spinning: /usr/sbin/xen-diag gnttab_query_size 6 domid=6: nr_frames=15, max_nr_frames=32 So it doesn't seem to be related to that issue. (DomID was 6 because four crashes since last physical host reboot, ugh.) I have adjusted the physical host to 256 as a number of people online recommended, but just did that this morning. I now see: /usr/sbin/xen-diag gnttab_query_size 2 domid=2: nr_frames=14, max_nr_frames=256 but again the exhaustion issue doesn't *seem* to have happened here... although I could be wrong. Because of the nature of the problem, the Xen oncrash action isn't triggered. The host can't tell that the guest has crashed, and it really hasn't crashed, it's just spinning, eating up CPU. The only thing I can do is destroy the guest, and recreate it. So where I am now is I'm remotely polling the machine from distant lands, every 60 seconds, and having myself paged out every time there is a crash in the hope I can try something else... but I am now out of something elses to try. The guest in question is a high-profile, high-usage guest for a client that expects 24/7 uptime... so this is, to me, rather a serious problem. I realize that the solution here may be "just upgrade the guest to 15.1"; however, I have two problems: 1. I cannot upgrade the guest until I have support from my customer's staff who can address their software compatibility issues pertaining to the differences in Python, PHP, etc., between 42.3 and 15.1... so I'm stuck here for a while. 2. In the process of running a new 15.1 guest on yet a third, different 15.1 host, I experienced a lockup on the guest there - which had no log entries at all and may be unrelated; however, it, too, was only running network/disk-intensive rsyncs at the time. I may need to post a seprate thread about that later; I'm not done taking debugging steps there yet. In short, I'm out of options. It seems to me that running a 42.3 guest on a 15.1 host shoud work, yet I am having these crashes. Thank you in advance for any help/guidance/pointers/cluebats. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org

2 4