[Bug 997575] New: [20160901] filesystem turns RO after calling "zypper in && rm" in a loop
http://bugzilla.suse.com/show_bug.cgi?id=997575 Bug ID: 997575 Summary: [20160901] filesystem turns RO after calling "zypper in && rm" in a loop Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 691135 --> http://bugzilla.suse.com/attachment.cgi?id=691135&action=edit y2log after filesystem turned read-only ## observation Calling zypper install and remove of a package in a loop for many times fails at some time with an error message that the media point is bad or something. It turns out the btrfs filesystem was turned read-only. Testing on a physical notebook machine. ## steps to reproduce * On a Tumbleweed 20160901 installation with btrfs filesystem call * `for i in {1..1000000} ; do echo $i ; zypper -n in nfs-kernel-server ; zypper -n rm nfs-kernel-server ; done` * see it fail after some time ## problem Might be related to bug #990384 and I also intended to do a proper crosscheck after I had some similar symptoms yesterday with only qemu-x86_64 machines. I will try to reproduce again after a reboot (or reinstall). Logs attached, please take a look. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c1 --- Comment #1 from Oliver Kurz <okurz@suse.com> --- Crosschecked on SLES 12 SP2 RC2: Doing the same test takes way longer, only reached 78 cycles so far, maybe because nfs-kernel-server is installed by default already, but no failures so far. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c2 --- Comment #2 from Oliver Kurz <okurz@suse.com> --- Created attachment 691189 --> http://bugzilla.suse.com/attachment.cgi?id=691189&action=edit logs from leap I could reproduce the same on Leap 42.2 Beta1, see attached logs. The error occurred after 15 cycles of "zypper in && rm" -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Normal |Major -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fdmanana@suse.com, | |jeffm@suse.com, | |tiwai@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c3 --- Comment #3 from Filipe Manana <fdmanana@suse.com> --- (In reply to Oliver Kurz from comment #0)
Created attachment 691135 [details] y2log after filesystem turned read-only
## observation Calling zypper install and remove of a package in a loop for many times fails at some time with an error message that the media point is bad or something. It turns out the btrfs filesystem was turned read-only. Testing on a physical notebook machine.
## steps to reproduce
* On a Tumbleweed 20160901 installation with btrfs filesystem call * `for i in {1..1000000} ; do echo $i ; zypper -n in nfs-kernel-server ; zypper -n rm nfs-kernel-server ; done` * see it fail after some time
## problem
Might be related to bug #990384 and I also intended to do a proper crosscheck after I had some similar symptoms yesterday with only qemu-x86_64 machines. I will try to reproduce again after a reboot (or reinstall).
Logs attached, please take a look.
Yes it's the same as #990384. The same debugging instructions listed there are needed. This depends on specific timings and at least snapshoting is happening in parallel (or has happened before that loop at least). The same has happened and reported rarely upstream (an extent item is missing in the extent tree for some unknown reason). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c4 --- Comment #4 from Filipe Manana <fdmanana@suse.com> --- Good news is that I'm also able to reproduce it on a fresh tumbleweed installation :) Now try to figure the mess and what's happening with snapshoting and balance. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c5 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CONFIRMED --- Comment #5 from Oliver Kurz <okurz@suse.com> --- ah, very good to hear. I could not reproduce the issue for the whole day for various reasons so I hope you can "help yourself" for now :-) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c6 --- Comment #6 from Oliver Kurz <okurz@suse.com> --- Trying to reproduce this on Leap 42.2 Beta1 and also a more recent build: In both cases on multiple tries I could somewhat reproduce issues but always ending up with an unresponsive system so no helpful logs could be gathered. After reboot, when the filesystem tries to replay the journal it takes ages and eventually fails with an OOM exception in plymouthd(!) so also there no luck. I then installed Leap 42.1 and ran the test overnight. It stopped when the harddisk capacity was depleted because of snapper snapshots but did not fail with any kernel problems so I can assume neither my hardware setup is flawed nor that the issue has been in Leap 42.1, of course, testing also on btrfs+snapshots. Now I will try an upgrade to Leap 42.2 and try again. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c7 --- Comment #7 from Filipe Manana <fdmanana@suse.com> --- (In reply to Oliver Kurz from comment #6)
Trying to reproduce this on Leap 42.2 Beta1 and also a more recent build: In both cases on multiple tries I could somewhat reproduce issues but always ending up with an unresponsive system so no helpful logs could be gathered. After reboot, when the filesystem tries to replay the journal it takes ages and eventually fails with an OOM exception in plymouthd(!) so also there no luck.
I then installed Leap 42.1 and ran the test overnight. It stopped when the harddisk capacity was depleted because of snapper snapshots but did not fail with any kernel problems so I can assume neither my hardware setup is flawed nor that the issue has been in Leap 42.1, of course, testing also on btrfs+snapshots.
Now I will try an upgrade to Leap 42.2 and try again.
Thanks for your attempts to reproduce Oliver. But I don't think you need to do it. I was able to reproduce it too, even with an upstream vanilla kernel. It's much easier to reproduce immediately after installing tumbleweed (on the first boot), but not so easy after building a new kernel with extra logging/tracing/etc or doing a few balances before. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|kernel-maintainers@forge.pr |fdmanana@suse.com |ovo.novell.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c8 --- Comment #8 from Oliver Kurz <okurz@suse.com> --- I just stumbled over this. Haven't heard about this for a long time. Isn't it solved by now? Maybe we had a corresponding SLES bug on top? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c9 --- Comment #9 from Filipe Manana <fdmanana@suse.com> --- (In reply to Oliver Kurz from comment #8)
I just stumbled over this. Haven't heard about this for a long time. Isn't it solved by now? Maybe we had a corresponding SLES bug on top?
It is, for well over an year. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c10 --- Comment #10 from Oliver Kurz <okurz@suse.com> --- Well, you are the assignee so … :) set the bug to resolved? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c11 Filipe Manana <fdmanana@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #11 from Filipe Manana <fdmanana@suse.com> --- I always thought that it was up to the reporter to confirm and close the bug report. So be it, done. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=997575 http://bugzilla.suse.com/show_bug.cgi?id=997575#c12 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED --- Comment #12 from Oliver Kurz <okurz@suse.com> --- Not according to https://wiki.microfocus.net/index.php/RD-OPS_QA/HowTos/Bugzilla_screening ;) But I can VERIFY :D -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com