January 2020

[opensuse-virtual] full cstate/cpufreq/cpupower support withOUT Xen 4.13 on kernel 5.4.14; WITH xen, none at all. bug or config?
by PGNet Dev 28 Jan '20

28 Jan '20

( I'd already posted this at xen-users; no traction to date ) I'm running linux kernel lsb_release -rd Description: openSUSE Leap 15.1 Release: 15.1 uname -rm 5.4.14-24.gfc4ea7a-default x86_64 dmesg | grep DMI: [ 0.000000] DMI: Supermicro X10SAT/X10SAT, BIOS 3.0 05/26/2015 cat /proc/cpuinfo | grep "model name" | head -n 1 model name : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz kernel & xen are pkg-installed from my KernelStable and Virtualization-Xen … [View More]repos @ OBS. BIOS *is* setup for max cstate support. Xeon E3-1220 does support intel_pstate driver. Testing first, (1) boot, NO XEN pstate driver's init'd dmesg | egrep -i "intel_pstate" [ 6.132964] intel_pstate: Intel P-state driver initializing pstate/cstate info cat /sys/module/intel_idle/parameters/max_cstate 9 cd /sys/devices/system/cpu/cpu0/cpuidle for state in state{0..9} do echo c-$state `cat $state/name` `cat $state/latency` done c-state0 POLL 0 c-state1 C1 2 c-state2 C1E 10 c-state3 C3 33 c-state4 C6 133 c-state5 C7s 166 cat: state6/name: No such file or directory cat: state6/latency: No such file or directory c-state6 cat: state7/name: No such file or directory cat: state7/latency: No such file or directory c-state7 cat: state8/name: No such file or directory cat: state8/latency: No such file or directory c-state8 cat: state9/name: No such file or directory cat: state9/latency: No such file or directory c-state9 cpufreq scaling info's available, cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 800 MHz - 3.50 GHz available cpufreq governors: performance powersave current policy: frequency should be within 800 MHz and 3.50 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: Unable to call hardware current CPU frequency: 799 MHz (asserted by call to kernel) boost state support: Supported: yes Active: yes & scaling is in effect, cat /proc/cpuinfo | grep MHz cpu MHz : 798.106 cpu MHz : 798.129 cpu MHz : 798.964 cpu MHz : 798.154 (2) boot, WITH Xen 4.13 rpm -qa | grep -i xen | sort grub2-x86_64-xen-2.04-lp151.6.5.noarch xen-4.13.0_04-lp151.688.2.x86_64 xen-libs-4.13.0_04-lp151.688.2.x86_64 xen-tools-4.13.0_04-lp151.688.2.x86_64 Xen cmd line includes, grep options= /boot/grub2/xen-4.13.0_04-lp151.688.cfg [config.1] options=dom0=pvh dom0-iommu=map-reserved dom0_mem=4016M,max:4096M dom0_max_vcpus=4 cpufreq=xen cpuidle ucode=scan ... intel_pstate support is now DISABLED for this cpu xl dmesg | grep pstate [ 6.851121] intel_pstate: CPU model not supported c-states report, xenpm get-cpuidle-states 0 All C-states allowed cpu id : 0 total C-states : 6 idle time(ms) : 45780911 C0 : transition [ 3204855] residency [ 160769 ms] C1 : transition [ 9204] residency [ 1018 ms] C2 : transition [ 10181] residency [ 2848 ms] C3 : transition [ 22784] residency [ 17236 ms] C4 : transition [ 7181] residency [ 11793 ms] C5 : transition [ 3155504] residency [ 45668846 ms] pc2 : [ 1685 ms] pc3 : [ 30695 ms] cc3 : [ 16858 ms] cc6 : [ 11640 ms] cc7 : [ 45602872 ms] NO cpupower frequency-info is available cpupower frequency-info analyzing CPU 0: no or unknown cpufreq driver is active on this CPU CPUs which run at the same hardware frequency: Not Available CPUs which need to have their frequency coordinated by software: Not Available maximum transition latency: Cannot determine or is not supported. Not Available available cpufreq governors: Not Available Unable to determine current policy current CPU frequency: Unable to call hardware current CPU frequency: Unable to call to kernel boost state support: Supported: no Active: no and scaling is NOT in effect cat /proc/cpuinfo | grep MHz cpu MHz : 3092.828 cpu MHz : 3092.828 cpu MHz : 3092.828 cpu MHz : 3092.828 attempt to add acpi-cpufreq module fails lsmod | grep acpi-cpufreq (empty) find /lib/modules/ | grep acpi-cpu /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko modprobe acpi-cpufreq modprobe: ERROR: could not insert 'acpi_cpufreq': No such device insmod /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko insmod: ERROR: could not insert module /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko: No such device Is this bug, or config? -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org [View Less]

2 7

[opensuse-virtual] xen 4.13 + kernel 5.4.11 'APIC Error ... FATAL PAGE FAULT' on reboot? non-Xen reboot's ok.
by PGNet Dev 16 Jan '20

16 Jan '20

I've a recently upgraded (pkgs via zypper), running Xen 4.13.0_04 server, on EFI hardware + Intel Xeon E3 CPU, with kernel 5.4.11-24.g2d02eb4-default on lsb_release -rd Description: openSUSE Leap 15.1 Release: 15.1 It boots as always, with no issue Welcome to GRUB! Please press t to show the boot menu on this console Xen 4.13.0_04-lp151.688 (c/s ) EFI loader Using configuration file 'xen-4.13.0_04-lp151.688.cfg' vmlinuz-5.4.11-24.g2d02eb4-default: … [View More]0x000000008b7c0000-0x000000008c04efb8 initrd-5.4.11-24.g2d02eb4-default: 0x000000008a4a5000-0x000000008b7bfe28 0x0000:0x00:0x19.0x0: ROM: 0x10000 bytes at 0x928a9018 0x0000:0x04:0x00.0x0: ROM: 0x8000 bytes at 0x928a0018 0x0000:0x10:0x00.0x0: ROM: 0x10800 bytes at 0x92885018 __ __ \ \/ /___ _ __ \ // _ \ '_ \ / \ __/ | | | /_/\_\___|_| |_| _ _ _ _____ ___ ___ _ _ _ _ ____ _ __ ___ ___ | || | / |___ / / _ \ / _ \| || | | |_ __ / | ___|/ | / /_ ( _ ) ( _ ) | || |_ | | |_ \| | | | | | | | || |_ __| | '_ \| |___ \| || '_ \ / _ \ / _ \ |__ _|| |___) | |_| | | |_| |__ _|__| | |_) | |___) | || (_) | (_) | (_) | |_|(_)_|____(_)___/___\___/ |_| |_| .__/|_|____/|_(_)___/ \___/ \___/ |_____| |_| (XEN) [00000026c8dc8909] Xen version 4.13.0_04-lp151.688 (abuild(a)suse.de) (gcc (SUSE Linux) 9.2.1 20200109 [gcc-9-branch revi sion 280039]) debug=n Wed Jan 8 11:43:04 UTC 2020 (XEN) [00000026cbd609dc] Latest ChangeSet: (XEN) [00000026cc9505ea] Bootloader: EFI (XEN) [00000026cd46f20f] Command line: dom0=pvh dom0-iommu=map-reserved dom0_mem=4016M,max:4096M bootscrub=false dom0_max_vcp us=4 vga=gfx-1920x1080x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 sched=credit2 ucode=scan log_buf_len=16M loglvl=warning guest_loglvl=none/warning noreboot=false iommu=verbose sync_console=false ... on exec of cmdline shutdown from shell, shutdown -r now the system DOES reboot, but first throws an APIC error -- only if running Xen, reboot with no-hypervisor has not probs 1st step, here's the current, relevant _log_ trace ... [ OK ] Reached target Shutdown. [ 343.932856] watchdog: watchdog0: watchdog did not stop! [ 346.871303] watchdog: watchdog0: watchdog did not stop! dracut Warning: Killing all remaining processes mdadm: stopped /dev/md4 mdadm: stopped /dev/md3 mdadm: stopped /dev/md2 mdadm: stopped /dev/md1 mdadm: stopped /dev/md0 Rebooting. [ 352.396918] reboot: Restarting system (XEN) [2020-01-15 15:01:26] Hardware Dom0 shutdown: rebooting machine (XEN) [2020-01-15 15:01:26] APIC error on CPU0: 40(00) (XEN) [2020-01-15 15:01:26] ----[ Xen-4.13.0_04-lp151.688 x86_64 debug=n Not tainted ]---- (XEN) [2020-01-15 15:01:26] CPU: 0 (XEN) [2020-01-15 15:01:26] RIP: e008:[<0000000000000000>] 0000000000000000 (XEN) [2020-01-15 15:01:26] RFLAGS: 0000000000010202 CONTEXT: hypervisor (XEN) [2020-01-15 15:01:26] rax: 0000000000000286 rbx: 0000000000000000 rcx: 0000000000000000 (XEN) [2020-01-15 15:01:26] rdx: 000000009e5ca7a0 rsi: 0000000000000000 rdi: 0000000000000000 (XEN) [2020-01-15 15:01:26] rbp: 0000000000000000 rsp: ffff83008ca2fa48 r8: ffff83008ca2fa90 (XEN) [2020-01-15 15:01:26] r9: ffff83008ca2fa80 r10: 0000000000000000 r11: 0000000000000000 (XEN) [2020-01-15 15:01:26] r12: 0000000000000000 r13: ffff83008ca2fb00 r14: ffff83008ca2ffff (XEN) [2020-01-15 15:01:26] r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000001526e0 (XEN) [2020-01-15 15:01:26] cr3: 00000008492ed000 cr2: ffffffffeef3f286 (XEN) [2020-01-15 15:01:26] fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 (XEN) [2020-01-15 15:01:26] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [2020-01-15 15:01:26] Xen code around <0000000000000000> (0000000000000000) [fault on access]: (XEN) [2020-01-15 15:01:26] -- -- -- -- -- -- -- -- <00> 80 00 f0 f3 ee 00 f0 c3 e2 00 f0 f3 ee 00 f0 (XEN) [2020-01-15 15:01:26] Xen stack trace from rsp=ffff83008ca2fa48: (XEN) [2020-01-15 15:01:26] 000000009e5ca3c9 ffff82d08036681f ffff82d08036682b 0000000000000000 (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff83008ca2fa88 0000000000000000 00000000001526e0 (XEN) [2020-01-15 15:01:26] ffff82d0802758cd 0000000000000286 0000000000000286 0000000000000000 (XEN) [2020-01-15 15:01:26] 000000009efe42f6 0000000000000000 0000000000000000 ffff83008ca2fb00 (XEN) [2020-01-15 15:01:26] ffff82d08036331b 0000000000152660 ffff82d0803636ae 0000000000000000 (XEN) [2020-01-15 15:01:26] ffff83008ca2fb48 0000000000000000 ffff82d080363688 000000008ca1f000 (XEN) [2020-01-15 15:01:26] ffff82d080937a98 000000fe00000000 ffff82d08029e41a 000000000000e008 (XEN) [2020-01-15 15:01:26] 0000000000000287 ffff830000000000 0000000000000000 0000000000000065 (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08029dd3c 000000008036682b 000082d08036681f (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08093dd00 0000000000000000 0000000000000000 (XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08029de17 ffff82d08023a742 ffff82d0809378c8 (XEN) [2020-01-15 15:01:26] ffff82d08093dd00 ffff82d08027ff48 ffff82d080000000 ffff83008ca2fd98 (XEN) [2020-01-15 15:01:26] ffff82d0000000fb ffff82d08036681f ffff82d08036682b ffff82d08036681f (XEN) [2020-01-15 15:01:27] ffff82d08036682b ffff82d08036681f ffff82d08036682b 0000000000000000 (XEN) [2020-01-15 15:01:27] 0000000000000000 0000000000000000 0000000000000000 ffff83008ca2ffff (XEN) [2020-01-15 15:01:27] 0000000000000000 ffff82d080366894 ffff82d08095e860 ffff830849340424 (XEN) [2020-01-15 15:01:27] ffff82d08095e820 ffff83008ca2fd98 ffff82d080823460 0000000000000002 (XEN) [2020-01-15 15:01:27] 0000000000000000 0000000000000000 0000000000000000 ffff83008ca2fd98 (XEN) [2020-01-15 15:01:27] 00000000000000c1 00000000000003f8 00000000000003fa ffff82d080823460 (XEN) [2020-01-15 15:01:27] 0000000000000004 000000fb00000000 ffff82d08024b590 000000000000e008 (XEN) [2020-01-15 15:01:27] Xen call trace: (XEN) [2020-01-15 15:01:27] [<0000000000000000>] R 0000000000000000 (XEN) [2020-01-15 15:01:27] [<000000009e5ca3c9>] S 000000009e5ca3c9 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d0802758cd>] S arch/x86/flushtlb.c#pre_flush+0x3d/0x70 (XEN) [2020-01-15 15:01:27] [<ffff82d08036331b>] S arch/x86/efi/runtime.c#efi_rs_enter.part.0+0xfb/0x130 (XEN) [2020-01-15 15:01:27] [<ffff82d0803636ae>] S efi_reset_system+0x4e/0x90 (XEN) [2020-01-15 15:01:27] [<ffff82d080363688>] S efi_reset_system+0x28/0x90 (XEN) [2020-01-15 15:01:27] [<ffff82d08029e41a>] S smp_send_stop+0xba/0xc0 (XEN) [2020-01-15 15:01:27] [<ffff82d08029dd3c>] S machine_restart+0x1fc/0x2d0 (XEN) [2020-01-15 15:01:27] [<ffff82d08029de17>] S arch/x86/shutdown.c#__machine_restart+0x7/0x10 (XEN) [2020-01-15 15:01:27] [<ffff82d08023a742>] S smp_call_function_interrupt+0x52/0x90 (XEN) [2020-01-15 15:01:27] [<ffff82d08027ff48>] S do_IRQ+0x2d8/0x760 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d080366894>] S common_interrupt+0x104/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08024b590>] S drivers/char/ns16550.c#ns16550_interrupt+0xc0/0xe0 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d080280107>] S do_IRQ+0x497/0x760 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d080366894>] S common_interrupt+0x104/0x120 (XEN) [2020-01-15 15:01:27] [<ffff82d0802d74dd>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x25d/0x3c0 (XEN) [2020-01-15 15:01:27] [<ffff82d0802d74d8>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x258/0x3c0 (XEN) [2020-01-15 15:01:27] [<ffff82d08023cca9>] S common/tasklet.c#tasklet_softirq_action+0x39/0x60 (XEN) [2020-01-15 15:01:27] [<ffff82d0802700ec>] S arch/x86/domain.c#idle_loop+0x8c/0xa0 (XEN) [2020-01-15 15:01:27] (XEN) [2020-01-15 15:01:27] Pagetable walk from ffffffffeef3f286: (XEN) [2020-01-15 15:01:27] L4[0x1ff] = 0000000000000000 ffffffffffffffff (XEN) [2020-01-15 15:01:27] (XEN) [2020-01-15 15:01:27] **************************************** (XEN) [2020-01-15 15:01:27] Panic on CPU 0: (XEN) [2020-01-15 15:01:27] FATAL PAGE FAULT (XEN) [2020-01-15 15:01:27] [error_code=0002] (XEN) [2020-01-15 15:01:27] Faulting linear address: ffffffffeef3f286 (XEN) [2020-01-15 15:01:27] **************************************** (XEN) [2020-01-15 15:01:27] (XEN) [2020-01-15 15:01:27] Reboot in five seconds... ... Is this a known/fixable issue? If more, specific info is needed here, pls let me know what to provide. -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org [View Less]

2 1

[opensuse-virtual] 15.1 Xen DomU freezing under high network/disk load, recoverable with NMI trigger
by Glen 08 Jan '20

08 Jan '20

Dear OpenSuse Team: Earlier today I sent a request to the list about a 42.3 DomU crashing. Olaf replied, and I've installed the new kernel, and I'll watch and see. I'm very grateful for the help. I'm sorry to post a second question, but I'm having a simliar-but-different problem on a different host and guest, and have reached an impasse. A few weeks ago, I took a copy of our crashy 42.3 DomU guest, and copied it to a new guest, just making a copy of the disk, and changing the name and IP … [View More]address and booting it on a different physical host. I then did zypper dup from 42.3->15.0->15.1. This was intended as a "test run", if you like, to predict how client software would react to the upgrade. So now I have an upgraded *copy* of my machine, running 15.1. All patches applied. And it's running on a different host, which was a fresh load of 15.1, also with all patches applied. Linux host 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) x86_64 x86_64 x86_64 GNU/Linux This guest has a problem as well, in that, under sustained high network/disk loads, the guest freezes up completely. This happened twice today - I can pretty much *make* it happen just by starting a local rsync (i.e. on a crossover cable) of it's main big data partition (3TB).... about every other attempt to copy the entire partition via rsync over ssh will freeze the guest. I get the same annoyingly terse message on the physical host: [92630.531549] vif vif-6-0 vif6.0: Guest Rx stalled [92630.531613] br0: port 2(vif6.0) entered disabled state but, unlike my 42.3 guest, this one gives *no* log outputs or data at all on the guest. No BUG, no CPU lockup, no kernel traceback, nothing. I left a high priority shell on the hvc0 console, which, when the 42.3 guest had its problem, was still sort of responsive, and I left "top -n 1; sleep 15" running in a while true loop on it... but it was completely frozen. I could see the final top before the hang, and there was nothing to suggest a problem. The guest just... hangs. Unlike the frozen 42.3 guest, which showed pretty much continuous "run" state, the 15.1 guest seems to do the more-or-less "normal" behavior in xentop - switching between "b" and "r" modes, and showing normal utilization patterns. But the guest itself is stuck tight. I have seen mentions about the grant frames issue, and I did apply the higher value to the host and guests: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 but this is still happening. Now here's the crazy part: I sat around trying to poke at the frozen guest and try different things before destroying it, and, skimming down my "xl" choices, I found "xl trigger". I had already tried pausing and unpausing the guest - that did nothing. But when I tried xl trigger (at random I tried the first option, so: xl trigger 6 nmi), the guest CAME BACK ONLINE! It said this: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue on the console. I also saw it in /var/log/messages, followed by: clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 554c072567f2 wd_last: 54137c19cb3c mask: ffffffffffffffff clocksource: 'tsc' cs_now: 2d696bb78816d4 cs_last: 2d6640097d695e mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog On the host in /var/log/messages, I saw: [93760.637546] vif vif-6-0 vif6.0: Guest Rx ready [93760.637595] br0: port 2(vif6.0) entered blocking state [93760.637598] br0: port 2(vif6.0) entered forwarding state And, apart from the rsync/sshd processes (which I suspect the remote side had given up), everything else came right back online. MySQL, for example, was still running on the guest without issue, in fact apart from the log entries I cite above, there was no indication that the machine had even been broken. The 5- and 10-minute load averages were way up in the 30s... but everything else was fine. Prior to the freeze, the guest was continuously showing a load average of about 3.0 - with the rsync and sshd processes in run mode, and that's it - just as I'd expect. The guest is provisioned thusly: name="gggv" description="gggv" uuid="13289776-1c74-9ade-4242-8f7453249832" memory=90112 maxmem=90112 vcpus=26 cpus="4-31" on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" extra="elevator=noop" disk=[ '/b/xen/gggv/gggv.root,raw,xvda1,w', '/b/xen/gggv/gggv.swap,raw,xvda2,w', '/b/xen/gggv/gggv.xa,raw,xvdb1,w', ] vif=[ 'mac=00:16:3f:04:05:41,bridge=br0', 'mac=00:16:3f:04:05:42,bridge=br1', ] vfb=['type=vnc,vncunused=1'] and is also the only guest running on its host. The host has: GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256" and is in every other respect an essentially fresh 15.1 load. I'm thinking that this is a different problem than my 42.3 guest problem, but I don't know what to do with it. My next move was to make sure my hardware (and data, and OS!) were okay. So I moved the root filesystem of my upgraded guest aside, and did a fresh load of 15.1 onto a new root filesystem. When I use *that* to boot my guest, it seems to be stable. High network activity does not appear to stop it - I've done 5 or 6 copies of my huge filesystem in that mode without issue. Of course I'd like to do more cycles to be sure, but it seems stable compared to when the upgraded root is in place, when I can make the machine freeze up on almost every (or every other) copy attempt. The only thing I can think of that is different here, then, would be that, maybe, since the guest has been zypper dup'ed over time all the way back from 13.2 (the last time it was built fresh), that maybe it's inherited some old garbage that could be causing this. It seems to me that a zypper dup'ed guest "should" work properly, especially when it is the same version and kernel as the physical host; but, again (sorry) I have these freezes. So just for laughs, I ran an lsmod in both modes, and sorted and diffed them: The "clean" guest (which appears to be stable), has these four kernel modules not present on the upgraded guest: iptable_raw nf_conntrack_ftp nf_nat_ftp xt_CT The "dup'ped" guest (which seems to be crashable on a large local rsync) has these modules not present on a clean install: auth_rpcgss br_netfilter bridge grace intel_rapl ipt_MASQUERADE llc lockd nf_conntrack_netlink nf_nat_masquerade_ipv4 nfnetlink nfs_acl nfsd overlay sb_edac stp sunrpc veth xfrm_algo xfrm_user xt_addrtype xt_nat Both guests share these additional sysctl.conf settings: kernel.panic = 5 vm.panic_on_oom = 2 vm.swappiness = 0 net.ipv6.conf.all.autoconf = 0 net.ipv6.conf.default.autoconf = 0 net.ipv6.conf.eth0.autoconf = 0 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_tw_reuse = 0 The dup'ped guest has these additional sysctl.conf settings: net.ipv4.tcp_tw_recycle = 0 net.core.netdev_max_backlog=300000 net.core.somaxconn = 2048 net.core.rmem_max=67108864 net.core.wmem_max=67108864 net.ipv4.ip_local_port_range=15000 65000 net.ipv4.tcp_sack=0 net.ipv4.tcp_rmem=4096 87380 67108864 net.ipv4.tcp_wmem=4096 65536 67108864 all of which have, more or less, worked well in the past (when everything was on 42.3) and may or may not be relevant here. I'm sorry, I feel like I'm missing something obvious here, but I can't see it. I would be grateful for any guidance or insights into this. Yes, in addition to trying to upgrade my client in place to 15.1, I could just build a new guest by hand, but that would be even more time-consuming and seems like it should not be necessary. If I might quote from the kernel, "Dazed and confused, but trying to continue" is exactly how I'm feeling here. Why could this guest be hanging? Why does an NMI bring it back? What should I do next? Anything anyone would be willing to point me to or suggest would be gratefully appreciated. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org [View Less]

4 12

[opensuse-virtual] 4.23 Xen DomU's crashing/hanging after upgrading Dom0 to 15.1
by Glen 07 Jan '20

07 Jan '20

Greetings all: I have a number of Xen hosts, and Xen guests on those hosts, all of which have been running reliably for users under 42.3 (and earlier 42.x versions) forever. Up until recently all hosts and guests were at 42.3, with all normal zypper updates applied, and running fine. Recently, the time came to upgrade to 15.1. I proceeded by upgrading the physical hosts to 15.1 first. Following that step, two of my largest and most high-volume 42.3 guests - on two entirely different … [View More]physical hosts - started crashing every few days. The largest one crashes the most frequently, I'll focus on that. The physical host is a Dell R520 with (Xen showing) 32 CPUs and 128GB of RAM. Linux php1 4.12.14-lp151.28.32-default #1 SMP Wed Nov 13 07:50:15 UTC 2019 (6e1aaad) x86_64 x86_64 x86_64 GNU/Linux (XEN) Xen version 4.12.1_04-lp151.2.6 (abuild(a)suse.de) (gcc (SUSE Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n Tue Nov 5 15:20:06 UTC 2019 (XEN) Latest ChangeSet: (XEN) Bootloader: GRUB2 2.02 (XEN) Command line: dom0_mem=4096M dom0_max_vcpus=4 dom0_vcpus_pin The guest is the only guest on this host. (For legacy reasons, it uses physical partitions on the host directly, rather than file-backed storage, but I don't feel like that should be an issue...) name="ghv1" description="ghv1" uuid="c77f49c6-1f72-9ade-4242-8f18e72cbb32" memory=124000 maxmem=124000 vcpus=24 on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" extra="elevator=noop" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" disk=[ '/dev/sda3,,xvda1,w', '/dev/sda5,,xvda2,w', '/dev/sda6,,xvda3,w', '/dev/sdb1,,xvdb1,w', ] vif=[ 'mac=00:16:3e:75:92:4a,bridge=br0', 'mac=00:16:3e:75:92:4b,bridge=br1', ] vfb=['type=vnc,vncunused=1'] It runs: Linux ghv1 4.4.180-102-default #1 SMP Mon Jun 17 13:11:23 UTC 2019 (7cfa20a) x86_64 x86_64 x86_64 GNU/Linux A typical xentop looks like this: xentop - 07:13:03 Xen 4.12.1_04-lp151.2.6 3 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 134171184k total, 132922412k used, 1248772k free CPUs: 32 @ 2100MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETT X(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 607 12.9 4194304 3.1 no limit n/a 4 0 0 0 0 0 0 0 0 0 0 ghv1 -----r 18351 246.5 126976000 94.6 126977024 94.6 24 2 31 9108 3240011 4 0 1132578 205040 31572906 8389002 0 Xenstore --b--- 0 0.0 32760 0.0 1341440 1.0 1 0 0 0 0 0 0 0 0 0 0 This guest is high volume. It runs web servers, mail list servers, databases, docker containers, and is regularly and constantly backed up via rsync over ssh. It is still at 42.3. As mentioned above, when its host was also at 42.3, it ran flawlessly. Only after upgrading the host to 15.1 did these problems start. What happens is this: After between 2 and 10 days of uptime, the guest will start to malfunction, with the following symptoms: 1. All network interfaces (there are two, one main, and one local 192.168.x.x) will disconnect. 2. Guest will exhibit a number of sshd processes apparently running at high CPU. These processes cannot be killed. 3. Guest console will be filled with messages like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303] These messages print 2-3 times in groups every 1-2 seconds. There is no pattern to the CPU IDs, all CPUs appear to be involved. 4. It will become impossible to log in to the guest console. 5. If I already have a high-priority shell logged in on the console, I can run some commands, (like sync), but I cannot cause the guest to shut down (init 0, for example, hangs the console, but the guest does not exit.) I can issue kill commands as hinted above, but they are ignored. 6. xl shutdown is also ineffective. I must xl destroy the guest and re-create it. The guest logs show things like the following (I've removed the "kernel: and timestamps just to make this more clear"): INFO: rcu_sched self-detected stall on CPU 8-...: (15000 ticks this GP) idle=b99/140000000000001/0 softirq=12292658/12292658 fqs=13805 (t=15001 jiffies g=8219341 c=8219340 q=139284) Task dump for CPU 8: sshd R running task 0 886 1 0x0000008c ffffffff81e79100 ffffffff810f10c5 ffff881dae01b300 ffffffff81e79100 0000000000000000 ffffffff81f67e60 ffffffff810f8575 ffffffff81105d2a ffff88125e810280 ffff881dae003d40 0000000000000008 ffff881dae003d08 Call Trace: [<ffffffff8101b0c9>] dump_trace+0x59/0x350 [<ffffffff8101b4ba>] show_stack_log_lvl+0xfa/0x180 [<ffffffff8101c2b1>] show_stack+0x21/0x40 [<ffffffff810f10c5>] rcu_dump_cpu_stacks+0x75/0xa0 [<ffffffff810f8575>] rcu_check_callbacks+0x535/0x7f0 [<ffffffff811010c2>] update_process_times+0x32/0x60 [<ffffffff8110fd00>] tick_sched_handle.isra.17+0x20/0x50 [<ffffffff8110ff78>] tick_sched_timer+0x38/0x60 [<ffffffff81101cf3>] __hrtimer_run_queues+0xf3/0x2a0 [<ffffffff81102179>] hrtimer_interrupt+0x99/0x1a0 [<ffffffff8100d1dc>] xen_timer_interrupt+0x2c/0x170 [<ffffffff810e39ec>] __handle_irq_event_percpu+0x4c/0x1d0 [<ffffffff810e3b90>] handle_irq_event_percpu+0x20/0x50 [<ffffffff810e7407>] handle_percpu_irq+0x37/0x50 [<ffffffff810e3174>] generic_handle_irq+0x24/0x30 [<ffffffff8142dce8>] __evtchn_fifo_handle_events+0x168/0x180 [<ffffffff8142aec9>] __xen_evtchn_do_upcall+0x49/0x80 [<ffffffff8142cb4c>] xen_evtchn_do_upcall+0x2c/0x50 [<ffffffff81655c6e>] xen_do_hypervisor_callback+0x1e/0x40 DWARF2 unwinder stuck at xen_do_hypervisor_callback+0x1e/0x40 Leftover inexact backtrace: <IRQ> <EOI> [<ffffffff81073840>] ? leave_mm+0xc0/0xc0 [<ffffffff81115e63>] ? smp_call_function_many+0x203/0x260 [<ffffffff81073840>] ? leave_mm+0xc0/0xc0 [<ffffffff81115f26>] ? on_each_cpu+0x36/0x70 [<ffffffff81074078>] ? flush_tlb_kernel_range+0x38/0x60 [<ffffffff811a8c17>] ? __alloc_pages_nodemask+0x117/0xbf0 [<ffffffff811fd14a>] ? kmem_cache_alloc_node_trace+0xaa/0x4d0 [<ffffffff811df823>] ? __purge_vmap_area_lazy+0x313/0x390 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed The above comes in all at once. Then every second or two thereafter, I see this: NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303] Modules linked in: ipt_REJECT nf_reject_ipv4 binfmt_misc veth nf_conntrack_ipv6 nf_defrag_ ipv6 xt_pkttype ip6table_filter ip6_tables xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquera de_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 n f_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_na t nf_conntrack br_netfilter bridge stp llc overlay af_packet iscsi_ibft iscsi_boot_sysfs i ntel_rapl sb_edac edac_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev xen_fbfront drbg fb_sys_fops syscopyarea sysfillrect xen_kbdfront ansi_cprng sysim gblt xen_netfront aesni_intel aes_x86_64 lrw gf128mul glue_helper pcspkr ablk_helper crypt d nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache xen_blkfront sg dm_m ultipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1 task: ffff881a44554ac0 ti: ffff8807b7d34000 task.ti: ffff8807b7d34000 RIP: e030:[<ffffffff810013ac>] [<ffffffff810013ac>] xen_hypercall_sched_op+0xc/0x20 RSP: e02b:ffff8807b7d37c10 EFLAGS: 00000206 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810013ac RDX: 0000000000000000 RSI: ffff8807b7d37c30 RDI: 0000000000000003 RBP: 0000000000000071 R08: 0000000000000000 R09: ffff880191804908 R10: ffff880191804ab8 R11: 0000000000000206 R12: ffffffff8237c178 R13: 0000000000440000 R14: 0000000000000100 R15: 0000000000000000 FS: 00007ff9142bd700(0000) GS:ffff881dae200000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffedcb82f56 CR3: 0000001a1d860000 CR4: 0000000000040660 Stack: 0000000000000000 00000000fffffffa ffffffff8142bd40 0000007400000003 ffff8807b7d37c2c ffffffff00000001 0000000000000000 ffff881dae2120d0 ffffffff81015b07 00000003810d34e4 ffffffff8237c178 ffff881dae21afc0 Call Trace: Inexact backtrace: [<ffffffff8142bd40>] ? xen_poll_irq_timeout+0x40/0x50 [<ffffffff81015b07>] ? xen_qlock_wait+0x77/0x80 [<ffffffff810d3637>] ? __pv_queued_spin_lock_slowpath+0x227/0x260 [<ffffffff8119edb4>] ? queued_spin_lock_slowpath+0x7/0xa [<ffffffff811df626>] ? __purge_vmap_area_lazy+0x116/0x390 [<ffffffff810ac942>] ? ___might_sleep+0xe2/0x120 [<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140 [<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420 [<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40 [<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0 [<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0 [<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0 [<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110 [<ffffffff811504a2>] ? do_seccomp+0x112/0x670 [<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90 [<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0 [<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc [<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed Code: 41 53 48 c7 c0 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 48 c7 c0 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc c c cc cc cc cc cc 51 After about 30 seconds or so, I note that there is a slight shift, in that this line: CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1 changes to something like: CPU: 15 PID: 1357 Comm: sshd Tainted: G L 4.4.180-102-default #1 The above log group continues to log, every few seconds, forever, until I kill the guest. The physical host is not impacted. It remains up, alive, connected to its networks, and functioning properly. The only output I get on the physical host is a one-time report: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state Steps I have taken: 1, I initially thought this might be a problem in openssh. There are reports on the net about a vulnerability in openssh versions prior to 7.3 (42.3 is at 7.2p2) in which a long string can be sent to sshd from the outside world and cause it to spin (and lock) out of control. I disabled that version of sshd on the guest, and installed the (then) latest version of openssh: 8.1p1. The problem persisted. 2. I have tried ifdown/ifup from within the guest to try to make the network reconnect, to no avail. 3. I have tried to unplug and replug the guest network from the host, to make the network reconnect, also to no avail. 4. Thinking that this might be related to recent reports of issues with grant tables in the blkfront driver, I checked usage on the DomU when it was spinning: /usr/sbin/xen-diag gnttab_query_size 6 domid=6: nr_frames=15, max_nr_frames=32 So it doesn't seem to be related to that issue. (DomID was 6 because four crashes since last physical host reboot, ugh.) I have adjusted the physical host to 256 as a number of people online recommended, but just did that this morning. I now see: /usr/sbin/xen-diag gnttab_query_size 2 domid=2: nr_frames=14, max_nr_frames=256 but again the exhaustion issue doesn't *seem* to have happened here... although I could be wrong. Because of the nature of the problem, the Xen oncrash action isn't triggered. The host can't tell that the guest has crashed, and it really hasn't crashed, it's just spinning, eating up CPU. The only thing I can do is destroy the guest, and recreate it. So where I am now is I'm remotely polling the machine from distant lands, every 60 seconds, and having myself paged out every time there is a crash in the hope I can try something else... but I am now out of something elses to try. The guest in question is a high-profile, high-usage guest for a client that expects 24/7 uptime... so this is, to me, rather a serious problem. I realize that the solution here may be "just upgrade the guest to 15.1"; however, I have two problems: 1. I cannot upgrade the guest until I have support from my customer's staff who can address their software compatibility issues pertaining to the differences in Python, PHP, etc., between 42.3 and 15.1... so I'm stuck here for a while. 2. In the process of running a new 15.1 guest on yet a third, different 15.1 host, I experienced a lockup on the guest there - which had no log entries at all and may be unrelated; however, it, too, was only running network/disk-intensive rsyncs at the time. I may need to post a seprate thread about that later; I'm not done taking debugging steps there yet. In short, I'm out of options. It seems to me that running a 42.3 guest on a 15.1 host shoud work, yet I am having these crashes. Thank you in advance for any help/guidance/pointers/cluebats. Glen -- To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org [View Less]

2 4

Main

Development

Information

Community

Social Media

Other

openSUSE Virtual