[Bug 1099494] New: zypper calls snapper, snapshot creation failed while snapper already running, still stalling system
http://bugzilla.suse.com/show_bug.cgi?id=1099494 Bug ID: 1099494 Summary: zypper calls snapper, snapshot creation failed while snapper already running, still stalling system Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Other Assignee: bnc-team-screening@forge.provo.novell.com Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 775554 --> http://bugzilla.suse.com/attachment.cgi?id=775554&action=edit y2logs, snapper.log, system journal, snapper cleanup journal, zypper.log ## Observation On an openSUSE Tumbleweed installation on a HDD disk (btrfs, encrypted, i.e. slow performance) calling zypper dup proposed to install around 3000 packages - as expected for openSUSE Tumbleweed upgrades. During that process it seems snapper cleanup was triggered in parallel but zypper also tried to create a snapshot but failed to do so effectively stalling the system for a very long time. Logfiles show quite some errors. E.g. from zypper.log: ``` 018-06-28 10:54:26 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! INFO:root:creating post snapshot 2018-06-28 10:54:26 <5> linux-9akz(17693) [zypp] Exception.cc(log):166 PluginExecutor.cc(doSend):133 CAUGHT: Not ready to read within timeout. 2018-06-28 10:54:26 <2> linux-9akz(17693) [zypp::plugin] PluginExecutor.cc(doSend):134 Not ready to read within timeout. 2018-06-28 10:54:26 <2> linux-9akz(17693) [zypp::plugin] PluginExecutor.cc(doSend):140 Bad plugin response from PluginScript[4691] /usr/lib/zypp/plugins/commit/snapper.py: PluginF rame[](0){0} 2018-06-28 10:54:26 <2> linux-9akz(17693) [zypp::plugin] PluginExecutor.cc(doSend):141 (Expected ACK or _ENOMETHOD) 2018-06-28 10:54:26 <1> linux-9akz(17693) [zypp::plugin++] PluginScript.cc(close):229 Close:PluginScript[4691] /usr/lib/zypp/plugins/commit/snapper.py 2018-06-28 10:54:26 <1> linux-9akz(17693) [zypp::plugin++] PluginScript.cc(send):272 PluginScript[4691] /usr/lib/zypp/plugins/commit/snapper.py ->send PluginFrame[_DISCONNECT](0){0} 2018-06-28 10:54:30 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root:creating snapshot failed: 2018-06-28 10:54:30 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. ``` whereas snapper-cleanup reports ``` 2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):91 - Snapper constructor 2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):92 - libsnapper version 0.5.4 2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):93 - config_name:root disable_filters:false 2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(reload):114 - loading file /etc/snapper/configs/root 2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:SUBVOLUME value:/ 2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:FSTYPE value:btrfs 2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:QGROUP value:1/0 2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:SYNC_ACL value:no 2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):125 - subvolume:/ filesystem:btrfs 2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(loadIgnorePatterns):174 - number of ignore patterns:8 2018-06-28 10:54:03 MIL libsnapper(4747) Snapshot.cc(read):245 - found 16 snapshots 2018-06-28 10:54:03 MIL libsnapper(4747) SystemCmd.cc(SystemCmd):46 - constructor SystemCmd:"/usr/lib/snapper/plugins/grub --refresh" 2018-06-28 10:54:03 MIL libsnapper(4747) SystemCmd.cc(execute):82 - SystemCmd Executing:"/usr/lib/snapper/plugins/grub --refresh" 2018-06-28 10:54:05 MIL libsnapper(4747) SystemCmd.cc(doExecute):267 - stopwatch 1.439245s for "/usr/lib/snapper/plugins/grub --refresh" 2018-06-28 10:54:05 MIL libsnapper(4747) SystemCmd.cc(doExecute):287 - system() Returns:0 2018-06-28 10:54:05 WAR libsnapper(4747) FileUtils.cc(SDir):88 - THROW: open failed path://.snapshots errno:2 (No such file or directory) 2018-06-28 10:55:01 MIL libsnapper(4747) SystemCmd.cc(SystemCmd):46 - constructor SystemCmd:"/usr/lib/snapper/plugins/grub --refresh" 2018-06-28 10:55:01 MIL libsnapper(4747) SystemCmd.cc(execute):82 - SystemCmd Executing:"/usr/lib/snapper/plugins/grub --refresh" 2018-06-28 10:55:03 MIL libsnapper(4747) SystemCmd.cc(doExecute):267 - stopwatch 1.765667s for "/usr/lib/snapper/plugins/grub --refresh" ``` so it looks like snapper was starting up just in before, coincidence? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c1 --- Comment #1 from Stefan Hundhammer <shundhammer@suse.com> --- I wouldn't expect anything else than the system hanging in this situation. - Creating a snapshot for a lot of changes -> triggering a snapshot cleanup - Requesting yet another snapshot while all this is running What exactly do you expect to happen? Creating the first snapshot needs more disk space, so that disk space needs to be cleaned up first. That cleanup takes time, and only after that the first snapshot can really be started. If then already yet another snapshot is requested, what should happen? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c2 Michael Andres <ma@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|ma@suse.com |aschnell@suse.com --- Comment #2 from Michael Andres <ma@suse.com> --- (In reply to Oliver Kurz from comment #0)
it seems snapper cleanup was triggered in parallel but zypper also tried to create a snapshot...
Well, ZYPP does execute installed commit plugins and but we must rely on them behaving well. If the sapper plugins misbehave, it's snapper issue. Assigning to Arvin. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c3 Arvin Schnell <aschnell@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|aschnell@suse.com |yast2-maintainers@suse.de --- Comment #3 from Arvin Schnell <aschnell@suse.com> --- In general snapper can handle to run a cleanup (which can require the very slow operation of comparing snapshots) and create new snapshots at the same time. Normally creating a snapshot only takes a fraction of a second. If the system has such high load that this operation takes more than 30 seconds there is not much snapper can do about that. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c4 Arvin Schnell <aschnell@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kukuk@suse.com Flags| |needinfo?(kukuk@suse.com) --- Comment #4 from Arvin Schnell <aschnell@suse.com> --- Thorsten, Lukas wants you to decide here as the architect how to proceed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c5 Thorsten Kukuk <kukuk@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(kukuk@suse.com) | --- Comment #5 from Thorsten Kukuk <kukuk@suse.com> --- (In reply to Stefan Hundhammer from comment #1)
Creating the first snapshot needs more disk space,
Creating a snapshot does not need disk space with btrfs. Updating the RPMs later needs disk space. (In reply to Arvin Schnell from comment #4)
Thorsten, Lukas wants you to decide here as the architect how to proceed.
For me, the problem isn't really clear. If I understand this correct, then the problem is, that this is already a slow disk, and the timeouts happen, because too much is accessing the harddisk? In this case I agree, there is not much we can do. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c6 --- Comment #6 from Oliver Kurz <okurz@suse.com> --- Well, the original issue is not about the system being slow due to snapshot handling but that the snapshot creation actually *failed*. AFAIK my system is fulfilling the system requirements so I expect that the snapshot handling might slow down my system but not fail. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 Knut Alejandro Anderssen González <knut.anderssen@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P3 - Medium Status|NEW |CONFIRMED URL| |https://trello.com/c/4Hslnl | |yf CC| |knut.anderssen@suse.com Assignee|yast2-maintainers@suse.de |yast-internal@suse.de -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c7 --- Comment #7 from Knut Alejandro Anderssen González <knut.anderssen@suse.com> --- @Arvin, taking in account last comments, I will add this bug to our Incoming Trello Board in order to prioritize it during next sprints -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c8 Lukas Ocilka <locilka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(aschnell@suse.com | |) --- Comment #8 from Lukas Ocilka <locilka@suse.com> --- (In reply to Oliver Kurz from comment #6)
Well, the original issue is not about the system being slow due to snapshot handling but that the snapshot creation actually *failed*.
And that is the question: Did snapper really fail or did the caller (of snapper) give up and decided to time-out? As I can read in the desc: PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root:creating snapshot failed: 2018-06-28 10:54:30 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. A.) If snapper failed, then we need to fix snapper B.) But if it just did not send a reply soon enough, we might need to extend timeouts or periodically check whether snapper is still doing something. Arvin? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c9 Arvin Schnell <aschnell@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(aschnell@suse.com | |) | --- Comment #9 from Arvin Schnell <aschnell@suse.com> --- AFAIS snapper does not fail and libzypp simple runs into the timeout. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c10 Lukas Ocilka <locilka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|Other |libzypp Assignee|yast-internal@suse.de |zypp-maintainers@forge.prov | |o.novell.com --- Comment #10 from Lukas Ocilka <locilka@suse.com> --- Reassigning according to the previous comment. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c11 Michael Andres <ma@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(aschnell@suse.com | |) --- Comment #11 from Michael Andres <ma@suse.com> --- Messages '[PLUGIN] PluginScript.cc(~PluginDumpStderr):' simply log the lines a plugin writes to it's stderr. So the message
! ERROR:root: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
was written by snapper.py not by libzypp. zypp timeout: According to the log:
snapper.py ->send PluginFrame[COMMITEND](0){370884} THROW: Not ready to read within timeout.
It's true that snapper.py does not acknowledge the receipt of the COMMITEND message (within 30 sec.). Due to this zypp stops serving the plugin.
snapper.py ->send PluginFrame[_DISCONNECT](0){0} snapper.py <-PluginFrame[ACK](0){0} snapper.py -> [0]
Zypp sends a _DISCONNECT to inform the plugin and the plugin acknowledges the receipt. If _DISCONNECT is acknowledged, zypp does not kill the script, but leaves it running in the background [1]. AFAICS the plugin is not killed by libzypp. OTOH it is not able to delay the transaction until it has completed (but that's actually intended). So what to change? [1] https://doc.opensuse.org/projects/libzypp/HEAD/zypp-plugins.html -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c12 --- Comment #12 from Arvin Schnell <aschnell@suse.com> --- Maybe python dbus adds another timeout. I will check that. But with the issue only being P3 for openSUSE it might take months until I have time. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 Michael Andres <ma@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|libzypp |Other Assignee|zypp-maintainers@forge.prov |aschnell@suse.com |o.novell.com | Flags|needinfo?(aschnell@suse.com | |) | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c13 Arvin Schnell <aschnell@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|aschnell@suse.com |yast-internal@suse.de --- Comment #13 from Arvin Schnell <aschnell@suse.com> --- A trello card to prioritised the issue with the other tasks already exists. Changing assignee to yast-internal to follow the process. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c14 Lukas Ocilka <locilka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |okurz@suse.com Flags| |needinfo?(okurz@suse.com) --- Comment #14 from Lukas Ocilka <locilka@suse.com> --- Oliver, how often does this change? Do we see that also is some customers' scenarios? Is this openQA-seen-only issue? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c15 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugzilla.suse.com/s | |how_bug.cgi?id=1063638 Flags|needinfo?(okurz@suse.com) | --- Comment #15 from Oliver Kurz <okurz@suse.com> --- Hi Lukas, (In reply to Lukas Ocilka from comment #14)
Oliver, how often does this change?
I am sorry, I do not know what you refer to with "change"? Do you mean how often this would "happen"? Then I can say that *explicitly* I have investigated this situation only once in detail to find out this cause as the investigation effort is pretty high. However the domain btrfs snapshots in general is a problematic one, e.g. see https://bugzilla.suse.com/show_bug.cgi?id=1063638 with all the non-SUSE comments there.
Do we see that also is some customers' scenarios?
I have not directly seen confirmations by customers but yes, I assume this is very likely to also show up in customer's scenarios given that I have not conducted anything "synthetic", just normal "zypper dup".
Is this openQA-seen-only issue?
This issue has *not* been seen by openQA but in manual operation by me. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 http://bugzilla.suse.com/show_bug.cgi?id=1099494#c16 --- Comment #16 from Oliver Kurz <okurz@suse.com> --- I created https://github.com/openSUSE/snapper/pull/437 which might help a bit in at least similar situations even though not the same as it concerns the "timeline" service and this bug involves the "cleanup" service. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1099494 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |http://bugzilla.suse.com/sh | |ow_bug.cgi?id=1028286 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1099494 Lukas Ocilka <locilka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|yast-internal@suse.de |aschnell@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1099494 https://bugzilla.suse.com/show_bug.cgi?id=1099494#c17 Arvin Schnell <aschnell@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(okurz@suse.com) --- Comment #17 from Arvin Schnell <aschnell@suse.com> --- The snapper plugin for zypper has been completely rewritten in C++ last year. Does the problem still happen? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1099494 https://bugzilla.suse.com/show_bug.cgi?id=1099494#c18 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |RESOLVED Resolution|--- |WORKSFORME Flags|needinfo?(okurz@suse.com) | --- Comment #18 from Oliver Kurz <okurz@suse.com> --- At least I have not observed that myself since then so let's assume it's at least behaving different after the rewrite :) -- You are receiving this mail because: You are on the CC list for the bug.
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com