[Bug 1202541] New: KVM/libvirt bug on download.o.o VM host?
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 Bug ID: 1202541 Summary: KVM/libvirt bug on download.o.o VM host? Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: All Status: NEW Severity: Normal Priority: P5 - None Component: KVM Assignee: kvm-bugs@suse.de Reporter: bwiedemann@suse.com QA Contact: qa-bugs@suse.de Found By: Development Blocker: --- Created attachment 860917 --> http://bugzilla.opensuse.org/attachment.cgi?id=860917&action=edit pontifex messages Tonight we had some hours of outage of download.opensuse.org While the last messages were about OOM, a closer look made me think that some bug in our live-migration or KVM layer might be to blame. download.o.o runs as a VM "pontifex2" on a KVM cluster with a shared FC block-storage. For Thursday maintenance we have scripts to live-migrate all VMs to an empty host before we upgrade+reboot the host. Yesterday, atreju6 was rebooted at 09:21 UTC Then it took some minutes for the next host to be evacuated... /var/log/libvirt/qemu/pontifex2.log shows 2022-08-18 09:38:40.092+0000: starting up libvirt version: 5.1.0, qemu version: 3.1.1SUSE Linux Enterprise 12, kernel: 4.12.14-122.130-default 2022-08-18 09:38:40.092+0000: Domain id=27 is tainted: host-cpu The pontifex2 log has 2022-08-18T09:40:59 general protection fault, probably for non-canonical address 0x17ffffc0000010: 0000 [#1] PREEMPT SMP NOPTI and from there it kept throwing 292 backtraces until it paniced tonight. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c1 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|KVM |KVM Version|Current |Leap 15.4 Product|openSUSE Tumbleweed |openSUSE Distribution --- Comment #1 from Bernhard Wiedemann <bwiedemann@suse.com> --- note: KVM host OS is 12-SP5 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c3 Claudio Fontana <claudio.fontana@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bwiedemann@suse.com, | |claudio.fontana@suse.com Flags| |needinfo?(bwiedemann@suse.c | |om) --- Comment #3 from Claudio Fontana <claudio.fontana@suse.com> --- hello, at present this bug is not actionable, please fill in the hardware architecture field, provide the specific action performed in terms of libvirt/qemu commands, and provide the libvirt/qemu logs as per: https://doc.opensuse.org/documentation/leap/virtualization/html/book-virtual... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c4 --- Comment #4 from Claudio Fontana <claudio.fontana@suse.com> --- "2022-08-18 09:38:40.092+0000: starting up libvirt version: 5.1.0" This bug is filed under Leap 15.4, are you sure this is correct and intended? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c5 James Fehlig <jfehlig@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jfehlig@suse.com --- Comment #5 from James Fehlig <jfehlig@suse.com> --- (In reply to Claudio Fontana from comment #4)
"2022-08-18 09:38:40.092+0000: starting up libvirt version: 5.1.0"
This bug is filed under Leap 15.4, are you sure this is correct and intended?
See comment #1. The host OS is SLES12 SP5. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c7 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(bwiedemann@suse.c | |om) | --- Comment #7 from Bernhard Wiedemann <bwiedemann@suse.com> --- journalctl --unit libvirtd is rather boring -- Logs begin at Thu 2022-08-18 09:21:33 UTC, end at Fri 2022-08-19 14:12:02 UTC. -- Aug 18 09:24:24 atreju6-suse systemd[1]: Starting Virtualization daemon... Aug 18 09:24:24 atreju6-suse systemd[1]: Started Virtualization daemon. Aug 18 09:26:00 atreju6-suse libvirtd[29544]: libvirt version: 5.1.0 Aug 18 09:26:00 atreju6-suse libvirtd[29544]: hostname: atreju6-suse Aug 18 09:38:40 atreju6-suse libvirtd[29544]: Domain id=27 name='pontifex2' uuid=6afc4ca3-0351-4dc3-9757-070f539de32b is tainted: host-cpu Aug 18 09:57:39 atreju6-suse libvirtd[29544]: Domain id=28 name='progress' uuid=822517bf-33d3-4871-9d7d-ac7f2d0ccf1f is tainted: host-cpu Aug 18 09:58:04 atreju6-suse libvirtd[29544]: Domain id=29 name='pagure01' uuid=9d1290ab-db8e-45c7-ab06-418ca07bd584 is tainted: host-cpu and one such taint entry for most of the VMs on the host. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c8 Claudio Fontana <claudio.fontana@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(bwiedemann@suse.c | |om) --- Comment #8 from Claudio Fontana <claudio.fontana@suse.com> --- I suppose the following is the part that you are interested in in the logs (I have to guess since you don't mention it). Since you suspect a problem in the host with migration, we'd need to have the corresponding host logs, including the libvirt logs, and if this is a production server, my suggestion would be to refile this as a SLES 12SP5 issue. But for what I see here, the guest encounters a general protection fault in "make_kuid" during "filename_lookup", in a low memory condition with OOMed processes. I don't see for now an indication that something went wrong with the migration. 2022-08-18T10:31:31.125351+00:00 pontifex2 kernel: [ 5236.588301][T25754] general protection fault, probably for non-canonical address 0xb110bc3ba3d49fbc: 0000 [#12] PREEMPT SMP NOPTI 2022-08-18T10:31:31.125398+00:00 pontifex2 kernel: [ 5236.591945][T25754] CPU: 5 PID: 25754 Comm: rsyncd Tainted: G D 5.14.21-150400.24.18-default #1 SLE15-SP4 695ab7a8fc20f5ddb345280570966cd1eb06d469 2022-08-18T10:31:31.125409+00:00 pontifex2 kernel: [ 5236.596511][T25754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 2022-08-18T10:31:31.125412+00:00 pontifex2 kernel: [ 5236.600209][T25754] RIP: 0010:__d_lookup_rcu+0x67/0x190 2022-08-18T10:31:31.125415+00:00 pontifex2 kernel: [ 5236.601560][T25754] Code: e8 49 c1 e9 20 48 8d 04 c2 45 89 ce 41 83 e6 07 48 8b 18 48 83 e3 fe 75 11 e9 92 00 00 00 48 8b 1b 48 85 db 0f 84 86 00 00 00 <8b> 6b fc 4c 3b 63 10 75 eb 48 83 7b 08 00 74 e4 83 e5 fe 41 f6 04 2022-08-18T10:31:31.125417+00:00 pontifex2 kernel: [ 5236.607438][T25754] RSP: 0018:ffffb6dc8c657ae0 EFLAGS: 00010286 2022-08-18T10:31:31.125418+00:00 pontifex2 kernel: [ 5236.609482][T25754] RAX: ffff8bc91c95dd50 RBX: b110bc3ba3d49fc0 RCX: 0000000000000009 2022-08-18T10:31:31.125420+00:00 pontifex2 kernel: [ 5236.612927][T25754] RDX: ffff8bc91b680000 RSI: ffffb6dc8c657c20 RDI: ffff8bb59caf7d40 2022-08-18T10:31:31.125422+00:00 pontifex2 kernel: [ 5236.615312][T25754] RBP: 00000000d410fd7a R08: ffffb6dc8c657c20 R09: 0000000000000038 2022-08-18T10:31:31.125423+00:00 pontifex2 kernel: [ 5236.617599][T25754] R10: ffff8bb24603a050 R11: 0000003800000000 R12: ffff8bb59caf7d40 2022-08-18T10:31:31.125427+00:00 pontifex2 kernel: [ 5236.624006][T25754] R13: 000000384b7755c2 R14: 0000000000000000 R15: ffffb6dc8c657d50 2022-08-18T10:31:31.125428+00:00 pontifex2 kernel: [ 5236.627676][T25754] FS: 00007fa500e97740(0000) GS:ffff8bc91fb40000(0000) knlGS:0000000000000000 2022-08-18T10:31:31.125430+00:00 pontifex2 kernel: [ 5236.629963][T25754] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2022-08-18T10:31:31.125431+00:00 pontifex2 kernel: [ 5236.631707][T25754] CR2: 0000556e9bb18fd0 CR3: 00000001cee98000 CR4: 00000000000406e0 2022-08-18T10:31:31.125433+00:00 pontifex2 kernel: [ 5236.633849][T25754] Call Trace: 2022-08-18T10:31:31.125435+00:00 pontifex2 kernel: [ 5236.634763][T25754] <TASK> 2022-08-18T10:31:31.125436+00:00 pontifex2 kernel: [ 5236.635767][T25754] ? make_kuid+0xf/0x20 2022-08-18T10:31:31.125439+00:00 pontifex2 kernel: [ 5236.636949][T25754] ? generic_permission+0x92/0x210 2022-08-18T10:31:31.125440+00:00 pontifex2 kernel: [ 5236.638620][T25754] lookup_fast+0x45/0x150 2022-08-18T10:31:31.125442+00:00 pontifex2 kernel: [ 5236.639729][T25754] walk_component+0x40/0x1a0 2022-08-18T10:31:31.125443+00:00 pontifex2 kernel: [ 5236.641185][T25754] ? path_init+0x5a/0x370 2022-08-18T10:31:31.125445+00:00 pontifex2 kernel: [ 5236.642218][T25754] path_lookupat+0x69/0x140 2022-08-18T10:31:31.125447+00:00 pontifex2 kernel: [ 5236.643875][T25754] ? make_kuid+0xf/0x20 2022-08-18T10:31:31.125448+00:00 pontifex2 kernel: [ 5236.646268][T25754] filename_lookup+0xe0/0x1c0 2022-08-18T10:31:31.125450+00:00 pontifex2 kernel: [ 5236.647686][T25754] ? try_to_unlazy+0x47/0x80 2022-08-18T10:31:31.125451+00:00 pontifex2 kernel: [ 5236.649171][T25754] ? terminate_walk+0x64/0xf0 2022-08-18T10:31:31.125453+00:00 pontifex2 kernel: [ 5236.650455][T25754] ? kmem_cache_alloc+0x4d/0x4c0 2022-08-18T10:31:31.125455+00:00 pontifex2 kernel: [ 5236.651903][T25754] ? path_lookupat+0x98/0x140 2022-08-18T10:31:31.125457+00:00 pontifex2 kernel: [ 5236.653071][T25754] ? vfs_statx+0x72/0x120 2022-08-18T10:31:31.125459+00:00 pontifex2 kernel: [ 5236.654256][T25754] vfs_statx+0x72/0x120 2022-08-18T10:31:31.125461+00:00 pontifex2 kernel: [ 5236.655240][T25754] __do_sys_newlstat+0x39/0x70 2022-08-18T10:31:31.125463+00:00 pontifex2 kernel: [ 5236.656504][T25754] ? _copy_to_user+0x1c/0x30 2022-08-18T10:31:31.125465+00:00 pontifex2 kernel: [ 5236.657584][T25754] ? cp_new_stat+0x150/0x190 2022-08-18T10:31:31.125466+00:00 pontifex2 kernel: [ 5236.658798][T25754] do_syscall_64+0x5b/0x80 2022-08-18T10:31:31.125468+00:00 pontifex2 kernel: [ 5236.659978][T25754] ? __do_sys_newlstat+0x48/0x70 2022-08-18T10:31:31.125469+00:00 pontifex2 kernel: [ 5236.661531][T25754] ? syscall_exit_to_user_mode+0x18/0x40 2022-08-18T10:31:31.125470+00:00 pontifex2 kernel: [ 5236.662952][T25754] ? do_syscall_64+0x67/0x80 2022-08-18T10:31:31.125472+00:00 pontifex2 kernel: [ 5236.664062][T25754] ? exc_page_fault+0x67/0x150 2022-08-18T10:31:31.125473+00:00 pontifex2 kernel: [ 5236.665414][T25754] entry_SYSCALL_64_after_hwframe+0x61/0xcb 2022-08-18T10:31:31.125476+00:00 pontifex2 kernel: [ 5236.666839][T25754] RIP: 0033:0x7fa4ff695f35 2022-08-18T10:31:31.125478+00:00 pontifex2 kernel: [ 5236.668055][T25754] Code: c3 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 29 2f 2e 00 f7 d8 64 89 2022-08-18T10:31:31.125481+00:00 pontifex2 kernel: [ 5236.673275][T25754] RSP: 002b:00007ffc18328098 EFLAGS: 00000246 ORIG_RAX: 0000000000000006 2022-08-18T10:31:31.125482+00:00 pontifex2 kernel: [ 5236.675135][T25754] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa4ff695f35 2022-08-18T10:31:31.125485+00:00 pontifex2 kernel: [ 5236.677428][T25754] RDX: 00007ffc183281c0 RSI: 00007ffc183281c0 RDI: 00007ffc18328250 2022-08-18T10:31:31.125487+00:00 pontifex2 kernel: [ 5236.679413][T25754] RBP: 0000000000000000 R08: 00007ffc183282b8 R09: 0000000000000000 2022-08-18T10:31:31.125489+00:00 pontifex2 kernel: [ 5236.681350][T25754] R10: 2e3030333035312d R11: 0000000000000246 R12: 00007ffc18328250 2022-08-18T10:31:31.125491+00:00 pontifex2 kernel: [ 5236.683128][T25754] R13: 00007ffc183281c0 R14: 0000000000000002 R15: 0000000000000004 2022-08-18T10:31:31.125492+00:00 pontifex2 kernel: [ 5236.685106][T25754] </TASK> 2022-08-18T10:31:31.125494+00:00 pontifex2 kernel: [ 5236.685809][T25754] Modules linked in: tcp_diag inet_diag xt_comment iptable_raw ip6table_raw xt_CT nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs af_packet iscsi_ibft iscsi_boot_sysfs rfkill xt_pkttype xt_tcpudp ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 iptable_filter bpfilter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables cirrus kvm_amd drm_kms_helper ccp xfs cec libcrc32c rc_core joydev virtio_balloon virtio_net net_failover failover kvm syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_piix4 pcspkr button irqbypass nfsd auth_rpcgss nfs_acl lockd grace drm fuse sunrpc configfs ip_tables x_tables ext4 crc16 mbcache jbd2 crc32_pclmul crc32c_intel ata_generic ghash_clmulni_intel ata_piix ahci libahci aesni_intel uhci_hcd ehci_hcd crypto_simd cryptd libata usbcore serio_raw virtio_blk floppy qemu_fw_cfg sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc 2022-08-18T10:31:31.125497+00:00 pontifex2 kernel: [ 5236.685914][T25754] scsi_dh_alua scsi_mod 2022-08-18T10:31:31.125498+00:00 pontifex2 kernel: [ 5236.710801][T25754] Supported: Yes 2022-08-18T10:31:31.125500+00:00 pontifex2 kernel: [ 5236.711917][T25754] ---[ end trace dcd867a75c208976 ]--- -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c11 --- Comment #11 from Claudio Fontana <claudio.fontana@suse.com> --- according to your attached host supportconfig, the plugin-libvirt.txt does not show any issue at the time of the errors appearing in the guest. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c12 --- Comment #12 from Bernhard Wiedemann <bwiedemann@suse.com> --- In the guest logs, the first OOM only appears at 2022-08-18T23:04:58 so I think, this is only a side-effect of the other breakage that started 2 minutes after migration. The guest has 83 GB of RAM and only 9GB are used atm (not counting caches+buffers) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c16 --- Comment #16 from Bernhard Wiedemann <bwiedemann@suse.com> --- Another interesting find: atreju6-suse:~ # cat /proc/cpuinfo |grep lwp|wc -l 16 atreju6-suse:~ # cat /proc/cpuinfo |grep nodeid_msr|wc -l 32 Only half of the cores have the lwp bit set and restarting libvirtd can make it appear or disappear in the virsh capabilities feature flags - probably if the process gets scheduled on the right 50%. http://developer.amd.com/wordpress/media/2012/10/43724.pdf p23 says
LWP is supported on a processor if CPUID Fn8000_0001_ECX[LWP] (bit 15) is set. This bit is identical to the value of CPUID Fn0000_000D_EDX_x0[bit 30], which is bit 62 of the XFeatureSupportedMask and indicates XSAVE support for LWP. A system can check either of those bits to determine if LWP is supported.
With the cpu host-passthrough mode we currently use, this random lwp availability could be the cause of trouble... unless guests never use it. And so far I did not find a single VM that listed the bit. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c17 --- Comment #17 from Claudio Fontana <claudio.fontana@suse.com> --- interesting, I _think_ there was a fix somewhere for the feature detection, let me dig a bit... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1202541 http://bugzilla.opensuse.org/show_bug.cgi?id=1202541#c18 --- Comment #18 from Claudio Fontana <claudio.fontana@suse.com> --- but if this is about the host cpu features being different, libvirt should not be the issue. I'd ensure that only the set of compatible cpus is in a valid cpuset for running the VMs, or make sure that the feature does not appear in any VM, so you can run the VM on any cpu. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com