https://bugzilla.suse.com/show_bug.cgi?id=1200564
Bug ID: 1200564 Summary: io_uring instability on ppc64 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: PowerPC-64 OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: dmueller@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: ---
when using aio=io_uring in qemu for KVM virtual machine guests on a SLE15SP4 or a opensuse tumbleweed host kernel on power8 or power9 machine, we have very rapid I/O corruption (within milliseconds-seconds) in the guest. on all other architectures things work perfectly fine.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c1
--- Comment #1 from Dirk Mueller dmueller@suse.com --- we validated the issue with SLE15SP4 update kernel as well as 5.17.x (iirc x==2) from tumbleweed/opensuse backports.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c2
--- Comment #2 from Dirk Mueller dmueller@suse.com --- Actually 5.18.1 also fails the same way.
https://bugzilla.suse.com/show_bug.cgi?id=1200564
Takashi Iwai tiwai@suse.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |ddiss@suse.com, | |msuchanek@suse.com, | |tiwai@suse.com
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c3
David Disseldorp ddiss@suse.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |rgoldwyn@suse.com Assignee|kernel-bugs@opensuse.org |ddiss@suse.com
--- Comment #3 from David Disseldorp ddiss@suse.com --- Thanks for the report. Just to confirm, both host and VM are ppc64le? I'll try to find some orthos hw to reproduce this.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c4
--- Comment #4 from Michal Suchanek msuchanek@suse.com --- I think it's worth filing an upstream bug report.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c5
--- Comment #5 from Michal Suchanek msuchanek@suse.com --- There isn't really Orthos HW for this. The Orthos KVM hosts like shiraz or zinfandel can run arbitrary VMs but it's probably not advisable to bring them down to boot a different host kernel.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c6
--- Comment #6 from David Disseldorp ddiss@suse.com --- (In reply to Michal Suchanek from comment #4)
I think it's worth filing an upstream bug report.
Agreed.
There isn't really Orthos HW for this. The Orthos KVM hosts like shiraz or zinfandel can run arbitrary VMs but it's probably not advisable to bring them down to boot a different host kernel.
Hmm, I might be able to give nested virtualization a shot(?). @Dirk: do you know of any hosts I might be able to use for this?
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c7
--- Comment #7 from Dirk Mueller dmueller@suse.com --- (In reply to Michal Suchanek from comment #4)
I think it's worth filing an upstream bug report.
+1, just need a bit of help in drafting a proper upstream report. as you can see the current information is probably not informative enough.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c8
--- Comment #8 from Michal Suchanek msuchanek@suse.com --- You probably want
- qemu commandline and version - kernel version of host and guest (the newer the better) - the way to observe the corruption
like if you create a zero filled disk, overwrite it with /dev/urandom, shut down the machine gracefully, and look at the disk do you still see zeroes?
or if you create disk filled with /dev/urandom, write zero to it, shutdown machine do you see non-zero blocks?
or write something in the VM, read back something else?
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c9
--- Comment #9 from Dirk Mueller dmueller@suse.com --- actually already the liburing embedded testsuite is failing on ppc64le.
https://bugzilla.suse.com/show_bug.cgi?id=1200564
Gabriel Krisman Bertazi gabriel.bertazi@suse.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |gabriel.bertazi@suse.com
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c10
--- Comment #10 from Gabriel Krisman Bertazi gabriel.bertazi@suse.com --- Hi.
I got a machine to work on this. Based on comment 1, I understand this is ppc64le, not ppc64, correct?
actually already the liburing embedded testsuite is failing on ppc64le.
Dirk, I know this was quite a while, but was the testsuite running on the host or the VM? We've got quite a few fixes to io_uring on SP4 since last year.
I'll try reproduce both the testsuite error and the fio corruption and report back. this might be already fixed.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c11
David Disseldorp ddiss@suse.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|ddiss@suse.com |gabriel.bertazi@suse.com
--- Comment #11 from David Disseldorp ddiss@suse.com --- Handing over to Gabriel who's asked to take this...
(In reply to Gabriel Krisman Bertazi from comment #10)
Hi.
I got a machine to work on this. Based on comment 1, I understand this is ppc64le, not ppc64, correct?
actually already the liburing embedded testsuite is failing on ppc64le.
Dirk, I know this was quite a while, but was the testsuite running on the host or the VM? We've got quite a few fixes to io_uring on SP4 since last year.
I'll try reproduce both the testsuite error and the fio corruption and report back. this might be already fixed.
IIRC the continuing testsuite failures on ppc64le appeared related to spurious EINTR syscall errors, which Dirk patched (in some places) via fa67f6aedcfdaffc14cbf0b631253477b2565ef0 .
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c12
--- Comment #12 from Dirk Mueller dmueller@suse.com --- (In reply to Gabriel Krisman Bertazi from comment #10)
I got a machine to work on this. Based on comment 1, I understand this is ppc64le, not ppc64, correct?
ppc64le indeed
actually already the liburing embedded testsuite is failing on ppc64le.
Dirk, I know this was quite a while, but was the testsuite running on the host or the VM? We've got quite a few fixes to io_uring on SP4 since last year.
we ran it on both, issue happened on both.
I'll try reproduce both the testsuite error and the fio corruption and report back. this might be already fixed.
That'd be nice. we can certainly do a new experiment with current SP4 kernel and see how far we get.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c13
--- Comment #13 from Dirk Mueller dmueller@suse.com --- (In reply to David Disseldorp from comment #11)
IIRC the continuing testsuite failures on ppc64le appeared related to spurious EINTR syscall errors, which Dirk patched (in some places) via fa67f6aedcfdaffc14cbf0b631253477b2565ef0 .
No, that's unrelated to this bugreport. the fs corruptions happens in real build during VM based OBS build jobs, while the patch above only fixed some testsuite issues.