[Bug 1200564] New: io_uring instability on ppc64
https://bugzilla.suse.com/show_bug.cgi?id=1200564 Bug ID: 1200564 Summary: io_uring instability on ppc64 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: PowerPC-64 OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: dmueller@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- when using aio=io_uring in qemu for KVM virtual machine guests on a SLE15SP4 or a opensuse tumbleweed host kernel on power8 or power9 machine, we have very rapid I/O corruption (within milliseconds-seconds) in the guest. on all other architectures things work perfectly fine. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c1 --- Comment #1 from Dirk Mueller <dmueller@suse.com> --- we validated the issue with SLE15SP4 update kernel as well as 5.17.x (iirc x==2) from tumbleweed/opensuse backports. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c2 --- Comment #2 from Dirk Mueller <dmueller@suse.com> --- Actually 5.18.1 also fails the same way. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ddiss@suse.com, | |msuchanek@suse.com, | |tiwai@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c3 David Disseldorp <ddiss@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rgoldwyn@suse.com Assignee|kernel-bugs@opensuse.org |ddiss@suse.com --- Comment #3 from David Disseldorp <ddiss@suse.com> --- Thanks for the report. Just to confirm, both host and VM are ppc64le? I'll try to find some orthos hw to reproduce this. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c4 --- Comment #4 from Michal Suchanek <msuchanek@suse.com> --- I think it's worth filing an upstream bug report. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c5 --- Comment #5 from Michal Suchanek <msuchanek@suse.com> --- There isn't really Orthos HW for this. The Orthos KVM hosts like shiraz or zinfandel can run arbitrary VMs but it's probably not advisable to bring them down to boot a different host kernel. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c6 --- Comment #6 from David Disseldorp <ddiss@suse.com> --- (In reply to Michal Suchanek from comment #4)
I think it's worth filing an upstream bug report.
Agreed.
There isn't really Orthos HW for this. The Orthos KVM hosts like shiraz or zinfandel can run arbitrary VMs but it's probably not advisable to bring them down to boot a different host kernel.
Hmm, I might be able to give nested virtualization a shot(?). @Dirk: do you know of any hosts I might be able to use for this? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c7 --- Comment #7 from Dirk Mueller <dmueller@suse.com> --- (In reply to Michal Suchanek from comment #4)
I think it's worth filing an upstream bug report.
+1, just need a bit of help in drafting a proper upstream report. as you can see the current information is probably not informative enough. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c8 --- Comment #8 from Michal Suchanek <msuchanek@suse.com> --- You probably want - qemu commandline and version - kernel version of host and guest (the newer the better) - the way to observe the corruption like if you create a zero filled disk, overwrite it with /dev/urandom, shut down the machine gracefully, and look at the disk do you still see zeroes? or if you create disk filled with /dev/urandom, write zero to it, shutdown machine do you see non-zero blocks? or write something in the VM, read back something else? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c9 --- Comment #9 from Dirk Mueller <dmueller@suse.com> --- actually already the liburing embedded testsuite is failing on ppc64le. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 Gabriel Krisman Bertazi <gabriel.bertazi@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |gabriel.bertazi@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c10 --- Comment #10 from Gabriel Krisman Bertazi <gabriel.bertazi@suse.com> --- Hi. I got a machine to work on this. Based on comment 1, I understand this is ppc64le, not ppc64, correct?
actually already the liburing embedded testsuite is failing on ppc64le.
Dirk, I know this was quite a while, but was the testsuite running on the host or the VM? We've got quite a few fixes to io_uring on SP4 since last year. I'll try reproduce both the testsuite error and the fio corruption and report back. this might be already fixed. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c11 David Disseldorp <ddiss@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|ddiss@suse.com |gabriel.bertazi@suse.com --- Comment #11 from David Disseldorp <ddiss@suse.com> --- Handing over to Gabriel who's asked to take this... (In reply to Gabriel Krisman Bertazi from comment #10)
Hi.
I got a machine to work on this. Based on comment 1, I understand this is ppc64le, not ppc64, correct?
actually already the liburing embedded testsuite is failing on ppc64le.
Dirk, I know this was quite a while, but was the testsuite running on the host or the VM? We've got quite a few fixes to io_uring on SP4 since last year.
I'll try reproduce both the testsuite error and the fio corruption and report back. this might be already fixed.
IIRC the continuing testsuite failures on ppc64le appeared related to spurious EINTR syscall errors, which Dirk patched (in some places) via fa67f6aedcfdaffc14cbf0b631253477b2565ef0 . -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c12 --- Comment #12 from Dirk Mueller <dmueller@suse.com> --- (In reply to Gabriel Krisman Bertazi from comment #10)
I got a machine to work on this. Based on comment 1, I understand this is ppc64le, not ppc64, correct?
ppc64le indeed
actually already the liburing embedded testsuite is failing on ppc64le. Dirk, I know this was quite a while, but was the testsuite running on the host or the VM? We've got quite a few fixes to io_uring on SP4 since last year.
we ran it on both, issue happened on both.
I'll try reproduce both the testsuite error and the fio corruption and report back. this might be already fixed.
That'd be nice. we can certainly do a new experiment with current SP4 kernel and see how far we get. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1200564 https://bugzilla.suse.com/show_bug.cgi?id=1200564#c13 --- Comment #13 from Dirk Mueller <dmueller@suse.com> --- (In reply to David Disseldorp from comment #11)
IIRC the continuing testsuite failures on ppc64le appeared related to spurious EINTR syscall errors, which Dirk patched (in some places) via fa67f6aedcfdaffc14cbf0b631253477b2565ef0 .
No, that's unrelated to this bugreport. the fs corruptions happens in real build during VM based OBS build jobs, while the patch above only fixed some testsuite issues. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com