openSUSE Virtual
Threads by month
- ----- 2025 -----
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
January 2020
- 6 participants
- 4 discussions
[opensuse-virtual] full cstate/cpufreq/cpupower support withOUT Xen 4.13 on kernel 5.4.14; WITH xen, none at all. bug or config?
by PGNet Dev 28 Jan '20
by PGNet Dev 28 Jan '20
28 Jan '20
( I'd already posted this at xen-users; no traction to date )
I'm running linux kernel
lsb_release -rd
Description: openSUSE Leap 15.1
Release: 15.1
uname -rm
5.4.14-24.gfc4ea7a-default x86_64
dmesg | grep DMI:
[ 0.000000] DMI: Supermicro X10SAT/X10SAT, BIOS 3.0 05/26/2015
cat /proc/cpuinfo | grep "model name" | head -n 1
model name : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
kernel & xen are pkg-installed from my KernelStable and Virtualization-Xen repos @ OBS.
BIOS *is* setup for max cstate support.
Xeon E3-1220 does support intel_pstate driver.
Testing first,
(1) boot, NO XEN
pstate driver's init'd
dmesg | egrep -i "intel_pstate"
[ 6.132964] intel_pstate: Intel P-state driver initializing
pstate/cstate info
cat /sys/module/intel_idle/parameters/max_cstate
9
cd /sys/devices/system/cpu/cpu0/cpuidle
for state in state{0..9}
do echo c-$state `cat $state/name` `cat $state/latency`
done
c-state0 POLL 0
c-state1 C1 2
c-state2 C1E 10
c-state3 C3 33
c-state4 C6 133
c-state5 C7s 166
cat: state6/name: No such file or directory
cat: state6/latency: No such file or directory
c-state6
cat: state7/name: No such file or directory
cat: state7/latency: No such file or directory
c-state7
cat: state8/name: No such file or directory
cat: state8/latency: No such file or directory
c-state8
cat: state9/name: No such file or directory
cat: state9/latency: No such file or directory
c-state9
cpufreq scaling info's available,
cpupower frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 800 MHz - 3.50 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 800 MHz and 3.50 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 799 MHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
& scaling is in effect,
cat /proc/cpuinfo | grep MHz
cpu MHz : 798.106
cpu MHz : 798.129
cpu MHz : 798.964
cpu MHz : 798.154
(2) boot, WITH Xen 4.13
rpm -qa | grep -i xen | sort
grub2-x86_64-xen-2.04-lp151.6.5.noarch
xen-4.13.0_04-lp151.688.2.x86_64
xen-libs-4.13.0_04-lp151.688.2.x86_64
xen-tools-4.13.0_04-lp151.688.2.x86_64
Xen cmd line includes,
grep options= /boot/grub2/xen-4.13.0_04-lp151.688.cfg
[config.1]
options=dom0=pvh dom0-iommu=map-reserved dom0_mem=4016M,max:4096M dom0_max_vcpus=4 cpufreq=xen cpuidle ucode=scan ...
intel_pstate support is now DISABLED for this cpu
xl dmesg | grep pstate
[ 6.851121] intel_pstate: CPU model not supported
c-states report,
xenpm get-cpuidle-states 0
All C-states allowed
cpu id : 0
total C-states : 6
idle time(ms) : 45780911
C0 : transition [ 3204855]
residency [ 160769 ms]
C1 : transition [ 9204]
residency [ 1018 ms]
C2 : transition [ 10181]
residency [ 2848 ms]
C3 : transition [ 22784]
residency [ 17236 ms]
C4 : transition [ 7181]
residency [ 11793 ms]
C5 : transition [ 3155504]
residency [ 45668846 ms]
pc2 : [ 1685 ms]
pc3 : [ 30695 ms]
cc3 : [ 16858 ms]
cc6 : [ 11640 ms]
cc7 : [ 45602872 ms]
NO cpupower frequency-info is available
cpupower frequency-info
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
CPUs which run at the same hardware frequency: Not Available
CPUs which need to have their frequency coordinated by software: Not Available
maximum transition latency: Cannot determine or is not supported.
Not Available
available cpufreq governors: Not Available
Unable to determine current policy
current CPU frequency: Unable to call hardware
current CPU frequency: Unable to call to kernel
boost state support:
Supported: no
Active: no
and scaling is NOT in effect
cat /proc/cpuinfo | grep MHz
cpu MHz : 3092.828
cpu MHz : 3092.828
cpu MHz : 3092.828
cpu MHz : 3092.828
attempt to add acpi-cpufreq module fails
lsmod | grep acpi-cpufreq
(empty)
find /lib/modules/ | grep acpi-cpu
/lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko
modprobe acpi-cpufreq
modprobe: ERROR: could not insert 'acpi_cpufreq': No such device
insmod /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko
insmod: ERROR: could not insert module /lib/modules/5.4.14-24.gfc4ea7a-default/kernel/drivers/cpufreq/acpi-cpufreq.ko: No such device
Is this bug, or config?
--
To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org
2
7
[opensuse-virtual] xen 4.13 + kernel 5.4.11 'APIC Error ... FATAL PAGE FAULT' on reboot? non-Xen reboot's ok.
by PGNet Dev 16 Jan '20
by PGNet Dev 16 Jan '20
16 Jan '20
I've a recently upgraded (pkgs via zypper), running
Xen 4.13.0_04
server, on EFI hardware + Intel Xeon E3 CPU, with kernel
5.4.11-24.g2d02eb4-default
on
lsb_release -rd
Description: openSUSE Leap 15.1
Release: 15.1
It boots as always, with no issue
Welcome to GRUB!
Please press t to show the boot menu on this console
Xen 4.13.0_04-lp151.688 (c/s ) EFI loader
Using configuration file 'xen-4.13.0_04-lp151.688.cfg'
vmlinuz-5.4.11-24.g2d02eb4-default: 0x000000008b7c0000-0x000000008c04efb8
initrd-5.4.11-24.g2d02eb4-default: 0x000000008a4a5000-0x000000008b7bfe28
0x0000:0x00:0x19.0x0: ROM: 0x10000 bytes at 0x928a9018
0x0000:0x04:0x00.0x0: ROM: 0x8000 bytes at 0x928a0018
0x0000:0x10:0x00.0x0: ROM: 0x10800 bytes at 0x92885018
__ __
\ \/ /___ _ __
\ // _ \ '_ \
/ \ __/ | | |
/_/\_\___|_| |_|
_ _ _ _____ ___ ___ _ _ _ _ ____ _ __ ___ ___
| || | / |___ / / _ \ / _ \| || | | |_ __ / | ___|/ | / /_ ( _ ) ( _ )
| || |_ | | |_ \| | | | | | | | || |_ __| | '_ \| |___ \| || '_ \ / _ \ / _ \
|__ _|| |___) | |_| | | |_| |__ _|__| | |_) | |___) | || (_) | (_) | (_) |
|_|(_)_|____(_)___/___\___/ |_| |_| .__/|_|____/|_(_)___/ \___/ \___/
|_____| |_|
(XEN) [00000026c8dc8909] Xen version 4.13.0_04-lp151.688 (abuild(a)suse.de) (gcc (SUSE Linux) 9.2.1 20200109 [gcc-9-branch revi
sion 280039]) debug=n Wed Jan 8 11:43:04 UTC 2020
(XEN) [00000026cbd609dc] Latest ChangeSet:
(XEN) [00000026cc9505ea] Bootloader: EFI
(XEN) [00000026cd46f20f] Command line: dom0=pvh dom0-iommu=map-reserved dom0_mem=4016M,max:4096M bootscrub=false dom0_max_vcp
us=4 vga=gfx-1920x1080x16 com1=115200,8n1,pci console=com1,vga console_timestamps console_to_ring conring_size=64 sched=credit2 ucode=scan log_buf_len=16M loglvl=warning guest_loglvl=none/warning noreboot=false iommu=verbose sync_console=false
...
on exec of cmdline shutdown from shell,
shutdown -r now
the system DOES reboot, but first throws an APIC error -- only if running Xen, reboot with no-hypervisor has not probs
1st step, here's the current, relevant _log_ trace
...
[ OK ] Reached target Shutdown.
[ 343.932856] watchdog: watchdog0: watchdog did not stop!
[ 346.871303] watchdog: watchdog0: watchdog did not stop!
dracut Warning: Killing all remaining processes
mdadm: stopped /dev/md4
mdadm: stopped /dev/md3
mdadm: stopped /dev/md2
mdadm: stopped /dev/md1
mdadm: stopped /dev/md0
Rebooting.
[ 352.396918] reboot: Restarting system
(XEN) [2020-01-15 15:01:26] Hardware Dom0 shutdown: rebooting machine
(XEN) [2020-01-15 15:01:26] APIC error on CPU0: 40(00)
(XEN) [2020-01-15 15:01:26] ----[ Xen-4.13.0_04-lp151.688 x86_64 debug=n Not tainted ]----
(XEN) [2020-01-15 15:01:26] CPU: 0
(XEN) [2020-01-15 15:01:26] RIP: e008:[<0000000000000000>] 0000000000000000
(XEN) [2020-01-15 15:01:26] RFLAGS: 0000000000010202 CONTEXT: hypervisor
(XEN) [2020-01-15 15:01:26] rax: 0000000000000286 rbx: 0000000000000000 rcx: 0000000000000000
(XEN) [2020-01-15 15:01:26] rdx: 000000009e5ca7a0 rsi: 0000000000000000 rdi: 0000000000000000
(XEN) [2020-01-15 15:01:26] rbp: 0000000000000000 rsp: ffff83008ca2fa48 r8: ffff83008ca2fa90
(XEN) [2020-01-15 15:01:26] r9: ffff83008ca2fa80 r10: 0000000000000000 r11: 0000000000000000
(XEN) [2020-01-15 15:01:26] r12: 0000000000000000 r13: ffff83008ca2fb00 r14: ffff83008ca2ffff
(XEN) [2020-01-15 15:01:26] r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000001526e0
(XEN) [2020-01-15 15:01:26] cr3: 00000008492ed000 cr2: ffffffffeef3f286
(XEN) [2020-01-15 15:01:26] fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000
(XEN) [2020-01-15 15:01:26] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
(XEN) [2020-01-15 15:01:26] Xen code around <0000000000000000> (0000000000000000) [fault on access]:
(XEN) [2020-01-15 15:01:26] -- -- -- -- -- -- -- -- <00> 80 00 f0 f3 ee 00 f0 c3 e2 00 f0 f3 ee 00 f0
(XEN) [2020-01-15 15:01:26] Xen stack trace from rsp=ffff83008ca2fa48:
(XEN) [2020-01-15 15:01:26] 000000009e5ca3c9 ffff82d08036681f ffff82d08036682b 0000000000000000
(XEN) [2020-01-15 15:01:26] 0000000000000000 ffff83008ca2fa88 0000000000000000 00000000001526e0
(XEN) [2020-01-15 15:01:26] ffff82d0802758cd 0000000000000286 0000000000000286 0000000000000000
(XEN) [2020-01-15 15:01:26] 000000009efe42f6 0000000000000000 0000000000000000 ffff83008ca2fb00
(XEN) [2020-01-15 15:01:26] ffff82d08036331b 0000000000152660 ffff82d0803636ae 0000000000000000
(XEN) [2020-01-15 15:01:26] ffff83008ca2fb48 0000000000000000 ffff82d080363688 000000008ca1f000
(XEN) [2020-01-15 15:01:26] ffff82d080937a98 000000fe00000000 ffff82d08029e41a 000000000000e008
(XEN) [2020-01-15 15:01:26] 0000000000000287 ffff830000000000 0000000000000000 0000000000000065
(XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08029dd3c 000000008036682b 000082d08036681f
(XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08093dd00 0000000000000000 0000000000000000
(XEN) [2020-01-15 15:01:26] 0000000000000000 ffff82d08029de17 ffff82d08023a742 ffff82d0809378c8
(XEN) [2020-01-15 15:01:26] ffff82d08093dd00 ffff82d08027ff48 ffff82d080000000 ffff83008ca2fd98
(XEN) [2020-01-15 15:01:26] ffff82d0000000fb ffff82d08036681f ffff82d08036682b ffff82d08036681f
(XEN) [2020-01-15 15:01:27] ffff82d08036682b ffff82d08036681f ffff82d08036682b 0000000000000000
(XEN) [2020-01-15 15:01:27] 0000000000000000 0000000000000000 0000000000000000 ffff83008ca2ffff
(XEN) [2020-01-15 15:01:27] 0000000000000000 ffff82d080366894 ffff82d08095e860 ffff830849340424
(XEN) [2020-01-15 15:01:27] ffff82d08095e820 ffff83008ca2fd98 ffff82d080823460 0000000000000002
(XEN) [2020-01-15 15:01:27] 0000000000000000 0000000000000000 0000000000000000 ffff83008ca2fd98
(XEN) [2020-01-15 15:01:27] 00000000000000c1 00000000000003f8 00000000000003fa ffff82d080823460
(XEN) [2020-01-15 15:01:27] 0000000000000004 000000fb00000000 ffff82d08024b590 000000000000e008
(XEN) [2020-01-15 15:01:27] Xen call trace:
(XEN) [2020-01-15 15:01:27] [<0000000000000000>] R 0000000000000000
(XEN) [2020-01-15 15:01:27] [<000000009e5ca3c9>] S 000000009e5ca3c9
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d0802758cd>] S arch/x86/flushtlb.c#pre_flush+0x3d/0x70
(XEN) [2020-01-15 15:01:27] [<ffff82d08036331b>] S arch/x86/efi/runtime.c#efi_rs_enter.part.0+0xfb/0x130
(XEN) [2020-01-15 15:01:27] [<ffff82d0803636ae>] S efi_reset_system+0x4e/0x90
(XEN) [2020-01-15 15:01:27] [<ffff82d080363688>] S efi_reset_system+0x28/0x90
(XEN) [2020-01-15 15:01:27] [<ffff82d08029e41a>] S smp_send_stop+0xba/0xc0
(XEN) [2020-01-15 15:01:27] [<ffff82d08029dd3c>] S machine_restart+0x1fc/0x2d0
(XEN) [2020-01-15 15:01:27] [<ffff82d08029de17>] S arch/x86/shutdown.c#__machine_restart+0x7/0x10
(XEN) [2020-01-15 15:01:27] [<ffff82d08023a742>] S smp_call_function_interrupt+0x52/0x90
(XEN) [2020-01-15 15:01:27] [<ffff82d08027ff48>] S do_IRQ+0x2d8/0x760
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d080366894>] S common_interrupt+0x104/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08024b590>] S drivers/char/ns16550.c#ns16550_interrupt+0xc0/0xe0
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d080280107>] S do_IRQ+0x497/0x760
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036681f>] S common_interrupt+0x8f/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d08036682b>] S common_interrupt+0x9b/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d080366894>] S common_interrupt+0x104/0x120
(XEN) [2020-01-15 15:01:27] [<ffff82d0802d74dd>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x25d/0x3c0
(XEN) [2020-01-15 15:01:27] [<ffff82d0802d74d8>] S arch/x86/cpu/mwait-idle.c#mwait_idle+0x258/0x3c0
(XEN) [2020-01-15 15:01:27] [<ffff82d08023cca9>] S common/tasklet.c#tasklet_softirq_action+0x39/0x60
(XEN) [2020-01-15 15:01:27] [<ffff82d0802700ec>] S arch/x86/domain.c#idle_loop+0x8c/0xa0
(XEN) [2020-01-15 15:01:27]
(XEN) [2020-01-15 15:01:27] Pagetable walk from ffffffffeef3f286:
(XEN) [2020-01-15 15:01:27] L4[0x1ff] = 0000000000000000 ffffffffffffffff
(XEN) [2020-01-15 15:01:27]
(XEN) [2020-01-15 15:01:27] ****************************************
(XEN) [2020-01-15 15:01:27] Panic on CPU 0:
(XEN) [2020-01-15 15:01:27] FATAL PAGE FAULT
(XEN) [2020-01-15 15:01:27] [error_code=0002]
(XEN) [2020-01-15 15:01:27] Faulting linear address: ffffffffeef3f286
(XEN) [2020-01-15 15:01:27] ****************************************
(XEN) [2020-01-15 15:01:27]
(XEN) [2020-01-15 15:01:27] Reboot in five seconds...
...
Is this a known/fixable issue?
If more, specific info is needed here, pls let me know what to provide.
--
To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org
2
1
[opensuse-virtual] 15.1 Xen DomU freezing under high network/disk load, recoverable with NMI trigger
by Glen 08 Jan '20
by Glen 08 Jan '20
08 Jan '20
Dear OpenSuse Team:
Earlier today I sent a request to the list about a 42.3 DomU crashing.
Olaf replied, and I've installed the new kernel, and I'll watch and
see. I'm very grateful for the help. I'm sorry to post a second
question, but I'm having a simliar-but-different problem on a
different host and guest, and have reached an impasse.
A few weeks ago, I took a copy of our crashy 42.3 DomU guest, and
copied it to a new guest, just making a copy of the disk, and changing
the name and IP address and booting it on a different physical host.
I then did zypper dup from 42.3->15.0->15.1. This was intended as a
"test run", if you like, to predict how client software would react to
the upgrade. So now I have an upgraded *copy* of my machine, running
15.1. All patches applied. And it's running on a different host,
which was a fresh load of 15.1, also with all patches applied.
Linux host 4.12.14-lp151.28.36-default #1 SMP Fri Dec 6 13:50:27 UTC
2019 (8f4a495) x86_64 x86_64 x86_64 GNU/Linux
This guest has a problem as well, in that, under sustained high
network/disk loads, the guest freezes up completely. This happened
twice today - I can pretty much *make* it happen just by starting a
local rsync (i.e. on a crossover cable) of it's main big data
partition (3TB).... about every other attempt to copy the entire
partition via rsync over ssh will freeze the guest. I get the same
annoyingly terse message on the physical host:
[92630.531549] vif vif-6-0 vif6.0: Guest Rx stalled
[92630.531613] br0: port 2(vif6.0) entered disabled state
but, unlike my 42.3 guest, this one gives *no* log outputs or data at
all on the guest. No BUG, no CPU lockup, no kernel traceback,
nothing. I left a high priority shell on the hvc0 console, which,
when the 42.3 guest had its problem, was still sort of responsive, and
I left "top -n 1; sleep 15" running in a while true loop on it... but
it was completely frozen. I could see the final top before the hang,
and there was nothing to suggest a problem. The guest just... hangs.
Unlike the frozen 42.3 guest, which showed pretty much continuous
"run" state, the 15.1 guest seems to do the more-or-less "normal"
behavior in xentop - switching between "b" and "r" modes, and showing
normal utilization patterns. But the guest itself is stuck tight.
I have seen mentions about the grant frames issue, and I did apply the
higher value to the host and guests:
# xen-diag gnttab_query_size 0 # Domain-0
domid=0: nr_frames=1, max_nr_frames=64
# xen-diag gnttab_query_size 1 # Xenstore
domid=1: nr_frames=4, max_nr_frames=4
# xen-diag gnttab_query_size 6 # My guest
domid=6: nr_frames=17, max_nr_frames=256
but this is still happening.
Now here's the crazy part:
I sat around trying to poke at the frozen guest and try different
things before destroying it, and, skimming down my "xl" choices, I
found "xl trigger". I had already tried pausing and unpausing the
guest - that did nothing. But when I tried xl trigger (at random I
tried the first option, so: xl trigger 6 nmi), the guest CAME BACK
ONLINE! It said this:
Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
on the console. I also saw it in /var/log/messages, followed by:
clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc'
as unstable because the skew is too large:
clocksource: 'xen' wd_now: 554c072567f2 wd_last: 54137c19cb3c mask:
ffffffffffffffff
clocksource: 'tsc' cs_now: 2d696bb78816d4 cs_last: 2d6640097d695e
mask: ffffffffffffffff
tsc: Marking TSC unstable due to clocksource watchdog
On the host in /var/log/messages, I saw:
[93760.637546] vif vif-6-0 vif6.0: Guest Rx ready
[93760.637595] br0: port 2(vif6.0) entered blocking state
[93760.637598] br0: port 2(vif6.0) entered forwarding state
And, apart from the rsync/sshd processes (which I suspect the remote
side had given up), everything else came right back online. MySQL,
for example, was still running on the guest without issue, in fact
apart from the log entries I cite above, there was no indication that
the machine had even been broken. The 5- and 10-minute load averages
were way up in the 30s... but everything else was fine.
Prior to the freeze, the guest was continuously showing a load average
of about 3.0 - with the rsync and sshd processes in run mode, and
that's it - just as I'd expect. The guest is provisioned thusly:
name="gggv"
description="gggv"
uuid="13289776-1c74-9ade-4242-8f7453249832"
memory=90112
maxmem=90112
vcpus=26
cpus="4-31"
on_poweroff="destroy"
on_reboot="restart"
on_crash="restart"
on_watchdog="restart"
localtime=0
keymap="en-us"
type="pv"
kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
extra="elevator=noop"
disk=[
'/b/xen/gggv/gggv.root,raw,xvda1,w',
'/b/xen/gggv/gggv.swap,raw,xvda2,w',
'/b/xen/gggv/gggv.xa,raw,xvdb1,w',
]
vif=[
'mac=00:16:3f:04:05:41,bridge=br0',
'mac=00:16:3f:04:05:42,bridge=br1',
]
vfb=['type=vnc,vncunused=1']
and is also the only guest running on its host. The host has:
GRUB_CMDLINE_XEN="dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
gnttab_max_frames=256" and is in every other respect an essentially
fresh 15.1 load.
I'm thinking that this is a different problem than my 42.3 guest
problem, but I don't know what to do with it.
My next move was to make sure my hardware (and data, and OS!) were
okay. So I moved the root filesystem of my upgraded guest aside, and
did a fresh load of 15.1 onto a new root filesystem. When I use
*that* to boot my guest, it seems to be stable. High network activity
does not appear to stop it - I've done 5 or 6 copies of my huge
filesystem in that mode without issue. Of course I'd like to do more
cycles to be sure, but it seems stable compared to when the upgraded
root is in place, when I can make the machine freeze up on almost
every (or every other) copy attempt.
The only thing I can think of that is different here, then, would be
that, maybe, since the guest has been zypper dup'ed over time all the
way back from 13.2 (the last time it was built fresh), that maybe it's
inherited some old garbage that could be causing this. It seems to
me that a zypper dup'ed guest "should" work properly, especially when
it is the same version and kernel as the physical host; but, again
(sorry) I have these freezes.
So just for laughs, I ran an lsmod in both modes, and sorted and diffed them:
The "clean" guest (which appears to be stable), has these four kernel
modules not present on the upgraded guest:
iptable_raw
nf_conntrack_ftp
nf_nat_ftp
xt_CT
The "dup'ped" guest (which seems to be crashable on a large local
rsync) has these modules not present on a clean install:
auth_rpcgss
br_netfilter
bridge
grace
intel_rapl
ipt_MASQUERADE
llc
lockd
nf_conntrack_netlink
nf_nat_masquerade_ipv4
nfnetlink
nfs_acl
nfsd
overlay
sb_edac
stp
sunrpc
veth
xfrm_algo
xfrm_user
xt_addrtype
xt_nat
Both guests share these additional sysctl.conf settings:
kernel.panic = 5
vm.panic_on_oom = 2
vm.swappiness = 0
net.ipv6.conf.all.autoconf = 0
net.ipv6.conf.default.autoconf = 0
net.ipv6.conf.eth0.autoconf = 0
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 0
The dup'ped guest has these additional sysctl.conf settings:
net.ipv4.tcp_tw_recycle = 0
net.core.netdev_max_backlog=300000
net.core.somaxconn = 2048
net.core.rmem_max=67108864
net.core.wmem_max=67108864
net.ipv4.ip_local_port_range=15000 65000
net.ipv4.tcp_sack=0
net.ipv4.tcp_rmem=4096 87380 67108864
net.ipv4.tcp_wmem=4096 65536 67108864
all of which have, more or less, worked well in the past (when
everything was on 42.3) and may or may not be relevant here.
I'm sorry, I feel like I'm missing something obvious here, but I can't
see it. I would be grateful for any guidance or insights into this.
Yes, in addition to trying to upgrade my client in place to 15.1, I
could just build a new guest by hand, but that would be even more
time-consuming and seems like it should not be necessary. If I might
quote from the kernel, "Dazed and confused, but trying to continue" is
exactly how I'm feeling here. Why could this guest be hanging? Why
does an NMI bring it back? What should I do next? Anything anyone
would be willing to point me to or suggest would be gratefully
appreciated.
Glen
--
To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org
4
12
07 Jan '20
Greetings all:
I have a number of Xen hosts, and Xen guests on those hosts, all of
which have been running reliably for users under 42.3 (and earlier
42.x versions) forever. Up until recently all hosts and guests were
at 42.3, with all normal zypper updates applied, and running fine.
Recently, the time came to upgrade to 15.1. I proceeded by upgrading
the physical hosts to 15.1 first. Following that step, two of my
largest and most high-volume 42.3 guests - on two entirely different
physical hosts - started crashing every few days. The largest one
crashes the most frequently, I'll focus on that.
The physical host is a Dell R520 with (Xen showing) 32 CPUs and 128GB of RAM.
Linux php1 4.12.14-lp151.28.32-default #1 SMP Wed Nov 13 07:50:15 UTC
2019 (6e1aaad) x86_64 x86_64 x86_64 GNU/Linux
(XEN) Xen version 4.12.1_04-lp151.2.6 (abuild(a)suse.de) (gcc (SUSE
Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n Tue Nov
5 15:20:06 UTC 2019
(XEN) Latest ChangeSet:
(XEN) Bootloader: GRUB2 2.02
(XEN) Command line: dom0_mem=4096M dom0_max_vcpus=4 dom0_vcpus_pin
The guest is the only guest on this host. (For legacy reasons, it
uses physical partitions on the host directly, rather than file-backed
storage, but I don't feel like that should be an issue...)
name="ghv1"
description="ghv1"
uuid="c77f49c6-1f72-9ade-4242-8f18e72cbb32"
memory=124000
maxmem=124000
vcpus=24
on_poweroff="destroy"
on_reboot="restart"
on_crash="restart"
on_watchdog="restart"
localtime=0
keymap="en-us"
type="pv"
extra="elevator=noop"
kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
disk=[
'/dev/sda3,,xvda1,w',
'/dev/sda5,,xvda2,w',
'/dev/sda6,,xvda3,w',
'/dev/sdb1,,xvdb1,w',
]
vif=[
'mac=00:16:3e:75:92:4a,bridge=br0',
'mac=00:16:3e:75:92:4b,bridge=br1',
]
vfb=['type=vnc,vncunused=1']
It runs:
Linux ghv1 4.4.180-102-default #1 SMP Mon Jun 17 13:11:23 UTC 2019
(7cfa20a) x86_64 x86_64 x86_64 GNU/Linux
A typical xentop looks like this:
xentop - 07:13:03 Xen 4.12.1_04-lp151.2.6
3 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 134171184k total, 132922412k used, 1248772k free CPUs: 32 @ 2100MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k)
MAXMEM(%) VCPUS NETS NETT
X(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 607 12.9 4194304 3.1 no limit
n/a 4 0
0 0 0 0 0 0 0 0 0
ghv1 -----r 18351 246.5 126976000 94.6 126977024
94.6 24 2 31
9108 3240011 4 0 1132578 205040 31572906 8389002 0
Xenstore --b--- 0 0.0 32760 0.0 1341440
1.0 1 0
0 0 0 0 0 0 0 0 0
This guest is high volume. It runs web servers, mail list servers,
databases, docker containers, and is regularly and constantly backed
up via rsync over ssh. It is still at 42.3. As mentioned above, when
its host was also at 42.3, it ran flawlessly. Only after upgrading
the host to 15.1 did these problems start.
What happens is this:
After between 2 and 10 days of uptime, the guest will start to
malfunction, with the following symptoms:
1. All network interfaces (there are two, one main, and one local
192.168.x.x) will disconnect.
2. Guest will exhibit a number of sshd processes apparently running at
high CPU. These processes cannot be killed.
3. Guest console will be filled with messages like this:
kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck
for 67s! [sshd:1303]
These messages print 2-3 times in groups every 1-2 seconds. There is
no pattern to the CPU IDs, all CPUs appear to be involved.
4. It will become impossible to log in to the guest console.
5. If I already have a high-priority shell logged in on the console, I
can run some commands, (like sync), but I cannot cause the guest to
shut down (init 0, for example, hangs the console, but the guest does
not exit.) I can issue kill commands as hinted above, but they are
ignored.
6. xl shutdown is also ineffective. I must xl destroy the guest and
re-create it.
The guest logs show things like the following (I've removed the
"kernel: and timestamps just to make this more clear"):
INFO: rcu_sched self-detected stall on CPU
8-...: (15000 ticks this GP) idle=b99/140000000000001/0
softirq=12292658/12292658 fqs=13805
(t=15001 jiffies g=8219341 c=8219340 q=139284)
Task dump for CPU 8:
sshd R running task 0 886 1 0x0000008c
ffffffff81e79100 ffffffff810f10c5 ffff881dae01b300 ffffffff81e79100
0000000000000000 ffffffff81f67e60 ffffffff810f8575 ffffffff81105d2a
ffff88125e810280 ffff881dae003d40 0000000000000008 ffff881dae003d08
Call Trace:
[<ffffffff8101b0c9>] dump_trace+0x59/0x350
[<ffffffff8101b4ba>] show_stack_log_lvl+0xfa/0x180
[<ffffffff8101c2b1>] show_stack+0x21/0x40
[<ffffffff810f10c5>] rcu_dump_cpu_stacks+0x75/0xa0
[<ffffffff810f8575>] rcu_check_callbacks+0x535/0x7f0
[<ffffffff811010c2>] update_process_times+0x32/0x60
[<ffffffff8110fd00>] tick_sched_handle.isra.17+0x20/0x50
[<ffffffff8110ff78>] tick_sched_timer+0x38/0x60
[<ffffffff81101cf3>] __hrtimer_run_queues+0xf3/0x2a0
[<ffffffff81102179>] hrtimer_interrupt+0x99/0x1a0
[<ffffffff8100d1dc>] xen_timer_interrupt+0x2c/0x170
[<ffffffff810e39ec>] __handle_irq_event_percpu+0x4c/0x1d0
[<ffffffff810e3b90>] handle_irq_event_percpu+0x20/0x50
[<ffffffff810e7407>] handle_percpu_irq+0x37/0x50
[<ffffffff810e3174>] generic_handle_irq+0x24/0x30
[<ffffffff8142dce8>] __evtchn_fifo_handle_events+0x168/0x180
[<ffffffff8142aec9>] __xen_evtchn_do_upcall+0x49/0x80
[<ffffffff8142cb4c>] xen_evtchn_do_upcall+0x2c/0x50
[<ffffffff81655c6e>] xen_do_hypervisor_callback+0x1e/0x40
DWARF2 unwinder stuck at xen_do_hypervisor_callback+0x1e/0x40
Leftover inexact backtrace:
<IRQ> <EOI> [<ffffffff81073840>] ? leave_mm+0xc0/0xc0
[<ffffffff81115e63>] ? smp_call_function_many+0x203/0x260
[<ffffffff81073840>] ? leave_mm+0xc0/0xc0
[<ffffffff81115f26>] ? on_each_cpu+0x36/0x70
[<ffffffff81074078>] ? flush_tlb_kernel_range+0x38/0x60
[<ffffffff811a8c17>] ? __alloc_pages_nodemask+0x117/0xbf0
[<ffffffff811fd14a>] ? kmem_cache_alloc_node_trace+0xaa/0x4d0
[<ffffffff811df823>] ? __purge_vmap_area_lazy+0x313/0x390
[<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140
[<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420
[<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40
[<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0
[<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0
[<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0
[<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110
[<ffffffff811504a2>] ? do_seccomp+0x112/0x670
[<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90
[<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0
[<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc
[<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed
The above comes in all at once. Then every second or two thereafter,
I see this:
NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! [sshd:1303]
Modules linked in: ipt_REJECT nf_reject_ipv4 binfmt_misc veth
nf_conntrack_ipv6 nf_defrag_
ipv6 xt_pkttype ip6table_filter ip6_tables xt_nat xt_tcpudp
ipt_MASQUERADE nf_nat_masquera
de_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat
nf_conntrack_ipv4 n
f_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables
xt_conntrack x_tables nf_na
t nf_conntrack br_netfilter bridge stp llc overlay af_packet
iscsi_ibft iscsi_boot_sysfs i
ntel_rapl sb_edac edac_core crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel
joydev xen_fbfront drbg fb_sys_fops syscopyarea sysfillrect
xen_kbdfront ansi_cprng sysim
gblt xen_netfront aesni_intel aes_x86_64 lrw gf128mul glue_helper
pcspkr ablk_helper crypt
d nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache
xen_blkfront sg dm_m
ultipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1
task: ffff881a44554ac0 ti: ffff8807b7d34000 task.ti: ffff8807b7d34000
RIP: e030:[<ffffffff810013ac>] [<ffffffff810013ac>]
xen_hypercall_sched_op+0xc/0x20
RSP: e02b:ffff8807b7d37c10 EFLAGS: 00000206
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810013ac
RDX: 0000000000000000 RSI: ffff8807b7d37c30 RDI: 0000000000000003
RBP: 0000000000000071 R08: 0000000000000000 R09: ffff880191804908
R10: ffff880191804ab8 R11: 0000000000000206 R12: ffffffff8237c178
R13: 0000000000440000 R14: 0000000000000100 R15: 0000000000000000
FS: 00007ff9142bd700(0000) GS:ffff881dae200000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffedcb82f56 CR3: 0000001a1d860000 CR4: 0000000000040660
Stack:
0000000000000000 00000000fffffffa ffffffff8142bd40 0000007400000003
ffff8807b7d37c2c ffffffff00000001 0000000000000000 ffff881dae2120d0
ffffffff81015b07 00000003810d34e4 ffffffff8237c178 ffff881dae21afc0
Call Trace:
Inexact backtrace:
[<ffffffff8142bd40>] ? xen_poll_irq_timeout+0x40/0x50
[<ffffffff81015b07>] ? xen_qlock_wait+0x77/0x80
[<ffffffff810d3637>] ? __pv_queued_spin_lock_slowpath+0x227/0x260
[<ffffffff8119edb4>] ? queued_spin_lock_slowpath+0x7/0xa
[<ffffffff811df626>] ? __purge_vmap_area_lazy+0x116/0x390
[<ffffffff810ac942>] ? ___might_sleep+0xe2/0x120
[<ffffffff811df9c3>] ? vm_unmap_aliases+0x123/0x140
[<ffffffff8106f127>] ? change_page_attr_set_clr+0xc7/0x420
[<ffffffff8107000d>] ? set_memory_ro+0x2d/0x40
[<ffffffff811836c1>] ? bpf_prog_select_runtime+0x21/0xa0
[<ffffffff81568e5b>] ? bpf_prepare_filter+0x58b/0x5d0
[<ffffffff81150080>] ? proc_watchdog_cpumask+0xd0/0xd0
[<ffffffff8156900e>] ? bpf_prog_create_from_user+0xce/0x110
[<ffffffff811504a2>] ? do_seccomp+0x112/0x670
[<ffffffff812bfb12>] ? security_task_prctl+0x52/0x90
[<ffffffff8109ca39>] ? SyS_prctl+0x539/0x5e0
[<ffffffff81081309>] ? syscall_slow_exit_work+0x39/0xcc
[<ffffffff81652d25>] ? entry_SYSCALL_64_fastpath+0x24/0xed
Code: 41 53 48 c7 c0 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc
cc cc cc cc cc cc cc
cc cc cc 51 41 53 48 c7 c0 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc
cc cc cc cc cc cc c
c cc cc cc cc cc 51
After about 30 seconds or so, I note that there is a slight shift, in
that this line:
CPU: 16 PID: 1303 Comm: sshd Not tainted 4.4.180-102-default #1
changes to something like:
CPU: 15 PID: 1357 Comm: sshd Tainted: G L 4.4.180-102-default #1
The above log group continues to log, every few seconds, forever,
until I kill the guest.
The physical host is not impacted. It remains up, alive, connected to
its networks, and functioning properly. The only output I get on the
physical host is a one-time report:
vif vif-6-0 vif6.0: Guest Rx stalled
br0: port 2(vif6.0) entered disabled state
Steps I have taken:
1, I initially thought this might be a problem in openssh. There are
reports on the net about a vulnerability in openssh versions prior to
7.3 (42.3 is at 7.2p2) in which a long string can be sent to sshd from
the outside world and cause it to spin (and lock) out of control. I
disabled that version of sshd on the guest, and installed the (then)
latest version of openssh: 8.1p1. The problem persisted.
2. I have tried ifdown/ifup from within the guest to try to make the
network reconnect, to no avail.
3. I have tried to unplug and replug the guest network from the host,
to make the network reconnect, also to no avail.
4. Thinking that this might be related to recent reports of issues
with grant tables in the blkfront driver, I checked usage on the DomU
when it was spinning:
/usr/sbin/xen-diag gnttab_query_size 6
domid=6: nr_frames=15, max_nr_frames=32
So it doesn't seem to be related to that issue. (DomID was 6 because
four crashes since last physical host reboot, ugh.) I have adjusted
the physical host to 256 as a number of people online recommended, but
just did that this morning. I now see:
/usr/sbin/xen-diag gnttab_query_size 2
domid=2: nr_frames=14, max_nr_frames=256
but again the exhaustion issue doesn't *seem* to have happened here...
although I could be wrong.
Because of the nature of the problem, the Xen oncrash action isn't
triggered. The host can't tell that the guest has crashed, and it
really hasn't crashed, it's just spinning, eating up CPU. The only
thing I can do is destroy the guest, and recreate it. So where I am
now is I'm remotely polling the machine from distant lands, every 60
seconds, and having myself paged out every time there is a crash in
the hope I can try something else... but I am now out of something
elses to try. The guest in question is a high-profile, high-usage
guest for a client that expects 24/7 uptime... so this is, to me,
rather a serious problem.
I realize that the solution here may be "just upgrade the guest to
15.1"; however, I have two problems:
1. I cannot upgrade the guest until I have support from my customer's
staff who can address their software compatibility issues pertaining
to the differences in Python, PHP, etc., between 42.3 and 15.1... so
I'm stuck here for a while.
2. In the process of running a new 15.1 guest on yet a third,
different 15.1 host, I experienced a lockup on the guest there - which
had no log entries at all and may be unrelated; however, it, too, was
only running network/disk-intensive rsyncs at the time. I may need to
post a seprate thread about that later; I'm not done taking debugging
steps there yet.
In short, I'm out of options. It seems to me that running a 42.3
guest on a 15.1 host shoud work, yet I am having these crashes.
Thank you in advance for any help/guidance/pointers/cluebats.
Glen
--
To unsubscribe, e-mail: opensuse-virtual+unsubscribe(a)opensuse.org
To contact the owner, e-mail: opensuse-virtual+owner(a)opensuse.org
2
4