[Bug 1229019] New: KVM:Starting KVM error "ensure all devices within the iommu_group are bound to their vfio bus driver" after kernel update
https://bugzilla.suse.com/show_bug.cgi?id=1229019 Bug ID: 1229019 Summary: KVM:Starting KVM error "ensure all devices within the iommu_group are bound to their vfio bus driver" after kernel update Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.6 Hardware: x86-64 OS: All Status: NEW Severity: Major Priority: P5 - None Component: KVM Assignee: kvm-bugs@suse.de Reporter: jniko19aban@gmail.com QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0 Build Identifier: After start KVM VM I have this error: "Error starting domain: internal error: QEMU unexpectedly closed the monitor (vm=‘win10’): qxl_send_events: spice-server bug: guest stopped, ignoring 2024-08-08T01:45:51.824474Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:0b:00.0”,“id”:“hostdev0”,“bus”:“pci.2”,“addr”:“0x0”}: vfio 0000:0b:00.0: group 83 is not viable Please ensure all devices within the iommu_group are bound to their vfio bus driver." The Intel I350-T2 NIC card is properly configured for VFIO passthrough, it was working correct without any single issue up to kernel 6.4.0-150600.23.14-default. After I did upgrade to latest stable 6.4.0-150600.23.17-default the error described above happened. I have 2 servers which do exactly the same thing after the upgrade to latest stable kernel. I have also tried to passthrough the built in Intel I-217 with the same result. If I revert to the old one 6.4.0-150600.23.14-default everything is working as supposed to be without any single issue. Reproducible: Always Steps to Reproduce: 1.Enable the iommu for Intel on /etc/default/grub 2.Assign the NIC to the VM for passthrough via Virtual Machine Manager (On VM Add Hardware / PCI Host device) 3.Start the VM guest. Actual Results: "Error starting domain: internal error: QEMU unexpectedly closed the monitor (vm=‘win10’): qxl_send_events: spice-server bug: guest stopped, ignoring 2024-08-08T01:45:51.824474Z qemu-system-x86_64: -device {“driver”:“vfio-pci”,“host”:“0000:0b:00.0”,“id”:“hostdev0”,“bus”:“pci.2”,“addr”:“0x0”}: vfio 0000:0b:00.0: group 83 is not viable Please ensure all devices within the iommu_group are bound to their vfio bus driver." Expected Results: Start the VM guest with the NIC passthrough properly and successfully. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 Jim Niko <jniko19aban@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High Target Milestone|--- |Leap 15.6 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c1 --- Comment #1 from Jim Niko <jniko19aban@gmail.com> --- Also to mention that on 6.4.0-150600.23.14-default kernel the vfio-pci is working correct. 0b:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Subsystem: Intel Corporation Ethernet Server Adapter I350-T2 Kernel driver in use: vfio-pci -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c2 --- Comment #2 from Jim Niko <jniko19aban@gmail.com> --- Created attachment 876572 --> https://bugzilla.suse.com/attachment.cgi?id=876572&action=edit Log files with bound / unbound on kernel 6.4.0-150600.23.14-default -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c3 --- Comment #3 from Jim Niko <jniko19aban@gmail.com> --- Created attachment 876573 --> https://bugzilla.suse.com/attachment.cgi?id=876573&action=edit Log with the error on 6.4.0-150600.23.17-default kernel -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 Malcolm Lewis <malcolmlewis@linuxmail.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |malcolmlewis@linuxmail.org -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c5 --- Comment #5 from Jim Niko <jniko19aban@gmail.com> --- (In reply to James Fehlig from comment #4)
(In reply to Jim Niko from comment #0)
The Intel I350-T2 NIC card is properly configured for VFIO passthrough, it was working correct without any single issue up to kernel 6.4.0-150600.23.14-default.
After I did upgrade to latest stable 6.4.0-150600.23.17-default the error described above happened.
Thanks for the report. Switching component to kernel since updating it introduces the regression.
Thank you for your response. When can we get a fix for this? Regards. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c6 --- Comment #6 from Jim Niko <jniko19aban@gmail.com> --- I have to add something. After some search it seems that the Intel I350 T2 NIC card does not use split IOMMU for each port and create this regression. On previous kernel each port was assigned in different groups and worked as supposed to be. This is the IOMMU group on the 6.4.0-150600.23.17-default kernel: 0b:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Subsystem: Intel Corporation Ethernet Server Adapter I350-T2 Kernel driver in use: igb Kernel modules: igb 0b:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Subsystem: Intel Corporation Ethernet Server Adapter I350-T2 Kernel driver in use: igb Kernel modules: igb Now if I pass through both ports on the same VM it works ok. If I try to pass through 1 port it creates the regression. I hope to be fixed soon this because we have 11 servers which have the same problem and can not be updated........ -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c7 --- Comment #7 from Jim Niko <jniko19aban@gmail.com> --- (In reply to Jim Niko from comment #6)
I have to add something.
After some search it seems that the Intel I350 T2 NIC card does not use split IOMMU for each port and create this regression. On previous kernel each port was assigned in different groups and worked as supposed to be.
This is the IOMMU group on the 6.4.0-150600.23.17-default kernel:
0b:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Subsystem: Intel Corporation Ethernet Server Adapter I350-T2 Kernel driver in use: igb Kernel modules: igb 0b:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Subsystem: Intel Corporation Ethernet Server Adapter I350-T2 Kernel driver in use: igb Kernel modules: igb
Now if I pass through both ports on the same VM it works ok.
If I try to pass through 1 port it creates the regression.
I hope to be fixed soon this because we have 11 servers which have the same problem and can not be updated........
It seems that the SR-IOV is not working at all...... tried now with quad port Intel I350 with the same result.... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 Jim Niko <jniko19aban@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P2 - High |P1 - Urgent -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c9 Samuel Mardel <samuelmardel@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |samuelmardel@gmail.com --- Comment #9 from Samuel Mardel <samuelmardel@gmail.com> --- Hi I am having the same problem - I have a PCIe network card I am passing through to a KVM virtual machine which is a VoIP server. This worked fine on on the 6.4.0-150600.23.14-default kernel, virtual machine will not boot with the 6.4.0-150600.23.17-default or 6.4.0-150600.23.22-default kernel. I am still having to boot back to 6.4.0-150600.23.14-default to get my virtual machine to boot. This card is a single port card. Error message is as follows on trying to manually boot the virtual machine: "Error starting domain: internal error: QEMU unexpectedly closed the monitor (vm='Debian'): 2024-09-23T23:16:54.444742Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:04:00.0","id":"hostdev0","bus":"pci.4","addr":"0x0"}: vfio 0000:04:00.0: group 32 not viable Please ensure all devices within the iommu_group are bound to their vfio bus driver." I can confirm that everything in the group is bound (it's the only card in the group) and obviously worked before the last 2 kernel updates and still works when manually booted to the 6.4.0-150600.23.14-default kernel. If you need any logs, let me know where to get them and I'll upload them here. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c10 --- Comment #10 from Samuel Mardel <samuelmardel@gmail.com> --- I've just had a look through the journal while running both versions of the kernel. 0000:04:00.0 which is the network card in question is in its own group with kernel 6.4.0-150600.23.14-default - group 40. On the newer kernels, it is showing 11 devices under group 32, including 0000:04:00.0, the group concerned with this problem - most of them have nothing attached. I have attached a few screenshots showing between the 2 kernels. Don't know if this helps. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c11 --- Comment #11 from Samuel Mardel <samuelmardel@gmail.com> --- (In reply to Samuel Mardel from comment #10)
I've just had a look through the journal while running both versions of the kernel. 0000:04:00.0 which is the network card in question is in its own group with kernel 6.4.0-150600.23.14-default - group 40. On the newer kernels, it is showing 11 devices under group 32, including 0000:04:00.0, the group concerned with this problem - most of them have nothing attached. I have attached a few screenshots showing between the 2 kernels. Don't know if this helps.
I meant group 15, not group 40 as in the last comment -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c12 --- Comment #12 from Samuel Mardel <samuelmardel@gmail.com> --- (In reply to Samuel Mardel from comment #11)
(In reply to Samuel Mardel from comment #10)
I've just had a look through the journal while running both versions of the kernel. 0000:04:00.0 which is the network card in question is in its own group with kernel 6.4.0-150600.23.14-default - group 40. On the newer kernels, it is showing 11 devices under group 32, including 0000:04:00.0, the group concerned with this problem - most of them have nothing attached. I have attached a few screenshots showing between the 2 kernels. Don't know if this helps.
I meant group 15, not group 40 as in the last comment
Ignore this, I was right the first time - group 40. This is confusing keeping track of whats where! -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c13 --- Comment #13 from Samuel Mardel <samuelmardel@gmail.com> --- Created attachment 877516 --> https://bugzilla.suse.com/attachment.cgi?id=877516&action=edit 6.4.0-150600.23.14-default journal -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c14 --- Comment #14 from Samuel Mardel <samuelmardel@gmail.com> --- Created attachment 877517 --> https://bugzilla.suse.com/attachment.cgi?id=877517&action=edit 6.4.0-150600.23.22-default journal -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c27 --- Comment #27 from Samuel Mardel <samuelmardel@gmail.com> --- Hi Have just tried your kernel - virtual machine is now booting with your patch. Sam -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c37 --- Comment #37 from OBSbugzilla Bot <bwiedemann+obsbugzillabot@suse.com> --- This is an autogenerated message for OBS integration: This bug (1229019) was mentioned in https://build.opensuse.org/request/show/1203745 Factory / kernel-source -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c42 --- Comment #42 from Samuel Mardel <samuelmardel@gmail.com> --- I have tested kernel-default-6.4.0-150600.1.1.g75ad405 and I can also confirm this works for me - my devices are isolating correctly and my virtual machine is booting fine with this. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c57 Samuel Mardel <samuelmardel@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(jniko19aban@gmail | |.com) --- Comment #57 from Samuel Mardel <samuelmardel@gmail.com> --- Hi I have installed the latest 6.4.0-150600.23.25-default today with good effect - PCI devices are grouping correctly now and virtual machine is running OK. Consider this resolved for me. @Jim Niko, don't know if you want to just make sure the latest kernel works your end? Thank you for all your help sorting this! Kind regards Sam -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c58 --- Comment #58 from Jim Niko <jniko19aban@gmail.com> --- (In reply to Samuel Mardel from comment #57)
Hi
I have installed the latest 6.4.0-150600.23.25-default today with good effect - PCI devices are grouping correctly now and virtual machine is running OK. Consider this resolved for me. @Jim Niko, don't know if you want to just make sure the latest kernel works your end?
Thank you for all your help sorting this! Kind regards Sam
I can confirm that everything is working correct on my side. Installed the latest kernel and the VM start normally. I think that this issue has resolved. Well done to everyone involved, this thread can be closed now. Best regards, Jim -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c59 Samuel Mardel <samuelmardel@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED --- Comment #59 from Samuel Mardel <samuelmardel@gmail.com> --- (In reply to Jim Niko from comment #58)
(In reply to Samuel Mardel from comment #57)
Hi
I have installed the latest 6.4.0-150600.23.25-default today with good effect - PCI devices are grouping correctly now and virtual machine is running OK. Consider this resolved for me. @Jim Niko, don't know if you want to just make sure the latest kernel works your end?
Thank you for all your help sorting this! Kind regards Sam
I can confirm that everything is working correct on my side. Installed the latest kernel and the VM start normally.
I think that this issue has resolved.
Well done to everyone involved, this thread can be closed now.
Best regards, Jim
@Jim Niko, thanks. Marking resolved. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1229019 https://bugzilla.suse.com/show_bug.cgi?id=1229019#c65 --- Comment #65 from OBSbugzilla Bot <bwiedemann+obsbugzillabot@suse.com> --- This is an autogenerated message for OBS integration: This bug (1229019) was mentioned in https://build.opensuse.org/request/show/1217138 Factory / kernel-source -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com