[Bug 1049574] New: snapper is blocked for >= 7 days -> not cleaning up snapshots, manual clean blocked
http://bugzilla.suse.com/show_bug.cgi?id=1049574 Bug ID: 1049574 Summary: snapper is blocked for >= 7 days -> not cleaning up snapshots, manual clean blocked Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.2 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- ## observation snapper is running for 7 days now on my machine. It looks like its hanging trying to communicate with dbus ``` Thread 1 (Thread 0x7fb70a09d880 (LWP 29289)): #0 0x00007fb708f251f0 in __poll_nocancel () from /lib64/libc.so.6 #1 0x00007fb7097ac268 in ?? () from /lib64/libdbus-1.so.3 #2 0x00007fb7097ab1cf in ?? () from /lib64/libdbus-1.so.3 #3 0x00007fb7097955fc in ?? () from /lib64/libdbus-1.so.3 #4 0x00007fb709795fa9 in ?? () from /lib64/libdbus-1.so.3 #5 0x00007fb70979656d in dbus_connection_send_with_reply_and_block () from /lib64/libdbus-1.so.3 #6 0x000000000043093d in ?? () #7 0x00000000004201c8 in ?? () #8 0x0000000000423a7e in ?? () #9 0x00000000004225df in ?? () #10 0x0000000000422868 in ?? () #11 0x0000000000422b65 in ?? () #12 0x0000000000414827 in ?? () #13 0x00000000004104d7 in ?? () #14 0x00007fb708e646e5 in __libc_start_main () from /lib64/libc.so.6 #15 0x00000000004120d9 in ?? () A debugging session is active. ``` processes: ``` root 29284 0.0 0.0 11868 2748 ? SN Jul13 0:00 \_ /bin/sh /etc/cron.daily/suse.de-snapper root 29289 0.0 0.0 56096 5008 ? SN Jul13 0:00 \_ snapper --config=root --quiet cleanup empty-pre-post okurz 25683 0.0 0.0 10536 1656 pts/8 S+ 08:15 0:00 | \_ grep --color=auto snap root 32676 0.0 0.0 292212 13208 ? Sl Jul13 0:20 /usr/sbin/snapperd ``` Version installed: snapper-0.3.3-2.2 ## reproducible I observed this multiple times already on my server instance. ## expected result snapper cleanup should stop itself after some time. ## Workaround Terminate snapper forcefully. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c1
Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c2
Steffen Winterfeldt
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c3
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c4
--- Comment #4 from Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c5
Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c6
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c7
Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c8
Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c9
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c10
--- Comment #10 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c11
--- Comment #11 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c12
--- Comment #12 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c13
--- Comment #13 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c14
--- Comment #14 from Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c15
--- Comment #15 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c16
--- Comment #16 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c17
--- Comment #17 from Nikolay Borisov
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c18
Oliver Kurz
Can you provide the output of echo w > /proc/sysrq-trigger when the system is in this state.
Thanks for the hint. That is a good idea. In the meantime I upgraded to openSUSE Leap 42.3 and did not observe that problem after 2 weeks so I assume we will never reproduce the problem. Closing ticket. Of course you can reopen it.
Also how many snapshots do you have
As I cleaned them manually whenever this problem hit me there were less than 10 snapshots in all occassions.
and are you using btrfs' qgroups?
Yes, I think so because it was a default openSUSE Leap installation (originally 42.1, then upgraded to 42.2).
Also is this btrfs a local filesystem
yes
or exported via nfs and you are just the client ?
no
Can you also upload the stacktrace of the snapperd threads which are waiting on the ioctl via cat /proc/PID/stack, I'm interested in the kernel-level stack traces.
Sorry, nothing more available anymore as the problem did not reappear as stated above.
Also do you see any btrfs related process taking hogging the CPU ?
No btrfs related process hogging the CPU. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c19
--- Comment #19 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c20
--- Comment #20 from Nikolay Borisov
Created attachment 791669 [details] snapper log, blocked tasks, backtrace of all snapper and snapperd threads
I managed to observe what looks like exactly the same problem on a SLES12SP3 host which got upgraded from SLE12SP1->SLE12SP2->SLE12SP3.
Please find attached the logs from snapper, the backtrace of both snapper and snapperd threads as well as the output from "sysrq w", all blocked tasks. The output from /proc/$(pidof snapperd)/stack:
``` [<ffffffff81222a45>] poll_schedule_timeout+0x45/0x60 [<ffffffff81223eac>] do_sys_poll+0x38c/0x4f0 [<ffffffff812240cd>] SyS_poll+0x5d/0xe0 [<ffffffff8161de61>] entry_SYSCALL_64_fastpath+0x20/0xe9 [<ffffffffffffffff>] 0xffffffffffffffff ```
Both snapper and snapperd seem to be running since Dec 01 on that host, that is since the last reboot.
I don't see anything related to btrfs in those logs. The blocked_tasks log is essentially empty, it just shows the scheduler's statistics and no kernel-level blocked tasks. The stacktrace you've shown here means that snapperd's main thread is waiting to be woken up and this is confirmed from the bt log of the main thread: Thread 1 (Thread 0x7fe23ec43880 (LWP 4330)): #0 0x00007fe23d8a630d in poll () from /lib64/libc.so.6 The snapperd process is comparing dirs according to the log in tmp/snapperd_bt_all.log and has submitted some ioctl to btrfs from the context of thread 18169. If the machine is still in this state could you provide the output of : /proc/18169/stack ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c21
Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c22
--- Comment #22 from Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c23
--- Comment #23 from Nikolay Borisov
# cat /proc/18169/stack [<ffffffff812178ac>] pipe_wait+0x5c/0x90 [<ffffffff812179b7>] pipe_write+0x87/0x430 [<ffffffff8120f0e0>] __vfs_write+0xd0/0x140 [<ffffffff8120fd3d>] vfs_write+0x9d/0x190 [<ffffffffa02bfc6c>] write_buf+0x5c/0x90 [btrfs] [<ffffffffa02c873b>] btrfs_ioctl_send+0x9bb/0x10c0 [btrfs] [<ffffffffa028db37>] btrfs_ioctl+0x3e7/0x23a0 [btrfs] [<ffffffff81222063>] do_vfs_ioctl+0x2e3/0x4c0 [<ffffffff812222b4>] SyS_ioctl+0x74/0x80 [<ffffffff8161de61>] entry_SYSCALL_64_fastpath+0x20/0xe9 [<ffffffffffffffff>] 0xffffffffffffffff
that now looks like btrfs related, right? ;)
This is a start! write_buf+0x5c/0x90 is actually the writebuf call in send_header so right at the beginning of sending the subvolume. Looking at send_header all it does is send BTRFS_SEND_STREAM_MAGI (which has the value: "btrfs-stream") and the stream version (1) to the fd pointed by : sctx->send_filp . From the stack trace it's clear that this fd actually points to a pipe and this fd is in turn derived from:
sctx->send_filp = fget(arg->send_fd);
So it's filled in by the caller of the ioctl (in this case snapperd). I assume it's some sort of a FIFO/named pipes. According to man: If a process attempts to read from an empty pipe, then read(2) will block until data is available. If a process attempts to write to a full pipe (see below), then write(2) blocks until sufficient data has been read from the pipe to allow the write to complete. So what I believe is happening is that the pipe passed by snapperd to btrfs is full (dunno if it's been re-used for other purposes, but not filled by btrfs since it hangs when sending the first couple of bytes defining a send-stream), naturally the users of the pipe (in this case btrfs send) blocks waiting for space in the pipe to be freed - and space can be freed by someone consuming it. This someone must be snapperd. So here I don't see anything related to btrfs per-se apart from the pipe passed from user space is not being read. In any case I think someone from snapper team needs to come here and explain how exactly they are managing the pipe and why could they be missing/ignoring a wake up. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c24
Oliver Kurz
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Jiri Slaby
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c25
Lukas Ocilka
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c26
--- Comment #26 from Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
Arvin Schnell
http://bugzilla.suse.com/show_bug.cgi?id=1049574
http://bugzilla.suse.com/show_bug.cgi?id=1049574#c27
Oliver Kurz
Oliver, is Leap 42.2 the right product? Is that an "evergreen" release?
I don't say we should not take a look, I just want to target the right product and I want to give it the right priority.
Sure, that makes sense. Well, Leap 42.2 is just the "version of the software the bug was found in." according to the help text in bugzilla. I am not sure if I should update this field if I see the problem as well in a more recent product. Leap 42.2 *is* EOL and to my knowledge there are no more "evergreen" releases. However I managed to see what looks like exactly the same problem on a SLE12SP3 installation on a different machine with https://bugzilla.suse.com/show_bug.cgi?id=1049574#c19 I gave access to this machine to aschnell to be able to investigate further. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com