[Bug 1080485] New: dracut: device mapper cannot be disassembled on shutdown
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Bug ID: 1080485 Summary: dracut: device mapper cannot be disassembled on shutdown Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: Stromeko@NexGo.DE QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Since a few months dracut spews about two pages of errors/warnings on shutdown. Yesterday I halted the system instead of poweroff to be able to read those and it seems that the unmount of /oldroot fails and then the disassembly of the device-mapper devices errors out (maybe there'd been other errors before that that already rolled off the screen). Interestingly, I sometimes get a lot less of these errors when a kernel update just happened before. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c1 Neil Rickert <nwr10cst-oslnx@yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nwr10cst-oslnx@yahoo.com --- Comment #1 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- I am probably seeing the same errors. I haven't reported, because they go past too quickly to grab a copy. I've tried to reproduce in a KVM virtual machine, but without success. I don't worry much about it, because the file systems check out as clean. So nothing is being corrupted. Maybe it is trying to disassemble something that does not need to be disassembled (i.e. could just be abandoned). I get this on both Tumbleweed and Leap 15.0. And occasionally I see a clean shutdown after updates. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c7 --- Comment #7 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- I'll note that comment 5 of bug 1083392 seems relevant. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c9 Ronnie Bailey <purevw@wtxs.net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |purevw@wtxs.net --- Comment #9 from Ronnie Bailey <purevw@wtxs.net> --- There are 2 primary errors on my system. I had to shoot a video to be able to catch and read them. There may be 60 or more repeats over a half second time period. I should note that this happens 80% of the time. So there are times when shutdown is without errors. Error 1: device-mapper: remove ioctl on system-root failed: Device or resource busy (system-root is my encrypted LVM Luks root partition) Error 2: device-mapper: remove ioctl on cr_scsi_hard_drive_partition_containing_root My /home is also a partition within the same LVM, but never produces errors. SLES 11 SP4 had a problem that produced the same error in 2016, but it's possible that the cause is not identical. Suse document is 7017390 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c10 --- Comment #10 from Ronnie Bailey <purevw@wtxs.net> --- I had a typo in my previous post. Error 2 should be: Error 2: device-mapper: remove ioctl on cr_scsi_hard_drive_partition_containing_root failed: Device or Resource busy -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c11 --- Comment #11 from Martin Wilck <martin.wilck@suse.com> --- Could you try this: (https://mirrors.edge.kernel.org/pub/linux/utils/boot/dracut/dracut.html#debu...) # mkdir -p /run/initramfs/etc/cmdline.d # echo "rd.debug rd.break=pre-shutdown rd.break=shutdown" > /run/initramfs/etc/cmdline.d/debug.conf # touch /run/initramfs/.need_shutdown You should now be able to watch better what's going on.
remove ioctl on system-root failed
Once we're back in the initramfs, shutting down the root LV should work - unless it's still in use by some process. This is only a real problem if the situation persists (e.g. because a file system hasn't been unmounted). bug 1083392 is quite interesting in this respect. It seems to be quite common that such "remove ioctl" errors occur transiently, and succeed after a few retries. The error messages are printed nonetheless. If it's just that, we may want to simply silence the error message and only print a message after the last retry has failed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c12 --- Comment #12 from Achim Gratz <Stromeko@NexGo.DE> --- Well, I don't know how to make screenshots on a system that's no connection to the outside world anymore. Besides, one problem is (as said in the initial mailing list thread) that there are so many messages that the trigger has scrolled off the screen and I can't scroll back on the dracut console (or at least not far enough). So if anybody can tell me how to create a netconsole that is active throughout the boot/shutdown process I can debug things from a different box, otherwise that'll have to wait until I can repurpose the serial temporarily. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c13 --- Comment #13 from Martin Wilck <martin.wilck@suse.com> --- Please try setting STARTMODE=nfsroot for the network connection that is used for netconsole. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c14 --- Comment #14 from Ronnie Bailey <purevw@wtxs.net> --- In my case, the last suggestion in comment 11 may work. I used to get shutdown time-outs of 90 seconds while the system was waiting for TeamViewer (3rd party) to shut down. I no longer get that error (possibly due to updates), but it is possible that TeamViewer is what is causing the delay in releasing root. I another forum, someone suggested allowing the first device-mapper error to post, then to remain silent unless there is a total failure in unmounting. Or if there were a way to add a 2 second delay to allow more time to complete the release of root, that may help. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c15 --- Comment #15 from Martin Wilck <martin.wilck@suse.com> --- Affected people, please check if /var/run and /var/lock on your systems are directories (bind mounts) or symlinks. The former may be the case if the system is an old installation with a long history of being updated. The latter would be the case on new installations. It they are directories, please move them away (will require an umount -l first) and create symlinks /var/run -> /run /var/lock -> /run/lock I've recently been debugging a similar case where this workaround helped. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c16 Christoph Obexer <cobexer@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |cobexer@gmail.com --- Comment #16 from Christoph Obexer <cobexer@gmail.com> --- (In reply to Martin Wilck from comment #15)
Affected people, please check if /var/run and /var/lock on your systems are directories (bind mounts) or symlinks. Both are symlinks on my system but I still experience the disassembly problem. My Symlinks are apparently from 28. June 2017.
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c17 --- Comment #17 from Martin Wilck <martin.wilck@suse.com> --- (In reply to Christoph Obexer from comment #16)
Both are symlinks on my system but I still experience the disassembly problem. My Symlinks are apparently from 28. June 2017.
Ok, so my guess was wrong. Maybe we can make progresss if someone could gather logs as described under "Shutdown Completes Eventually" on https://freedesktop.org/wiki/Software/systemd/Debugging/. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c18 Michiel Janssens <michiel@nexigon.net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |michiel@nexigon.net --- Comment #18 from Michiel Janssens <michiel@nexigon.net> --- Created attachment 769904 --> http://bugzilla.opensuse.org/attachment.cgi?id=769904&action=edit /proc/cmdline -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c19 --- Comment #19 from Michiel Janssens <michiel@nexigon.net> --- Created attachment 769906 --> http://bugzilla.opensuse.org/attachment.cgi?id=769906&action=edit fstab -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c20 --- Comment #20 from Michiel Janssens <michiel@nexigon.net> --- Created attachment 769907 --> http://bugzilla.opensuse.org/attachment.cgi?id=769907&action=edit rdsosreport_preshutdown -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c21 --- Comment #21 from Michiel Janssens <michiel@nexigon.net> --- Created attachment 769908 --> http://bugzilla.opensuse.org/attachment.cgi?id=769908&action=edit rdsosreport_shutdown -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c22 --- Comment #22 from Michiel Janssens <michiel@nexigon.net> --- Created attachment 769909 --> http://bugzilla.opensuse.org/attachment.cgi?id=769909&action=edit systemd_debug_shutdown-log -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c23 --- Comment #23 from Michiel Janssens <michiel@nexigon.net> --- As I was also was getting these shutdown messages, I tried to get some logging info as requested. see attached files. I hope this will be helpful in determining if the messages should be suppressed or not. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Erwin Lam <erwinl@dds.nl> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |erwinl@dds.nl -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c24 --- Comment #24 from Ronnie Bailey <purevw@wtxs.net> --- I had opened a thread in the openSUSE Tumbleweed forum regarding this problem. A user has found a workaround that may at least give a hint as to what is going on. It was posted that: "Adding "plymouth.enable=0" to the kernel parameters makes the "device-mapper: remove ioctl on XXXXX failed: Device or resource busy" messages disappear." I have no idea if Plymouth is the problem or just a direction to look. I can verify that adding the above kernel parameter has stopped the error in my Tumbleweed machine. My original thread may be found at: https://forums.opensuse.org/showthread.php/530530-device-mapper-remove-ioctl... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c25 José Díaz <jdiaz@felino.cl> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jdiaz@felino.cl --- Comment #25 from José Díaz <jdiaz@felino.cl> --- Same as Comment 24, I confirm the error messages disappear when disabling plymouth in a laptop running Leap 15. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c26 --- Comment #26 from Martin Wilck <martin.wilck@suse.com> --- My guess would be that disabling plymouth only hides the messages. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c27 --- Comment #27 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- Created attachment 780445 --> http://bugzilla.opensuse.org/attachment.cgi?id=780445&action=edit Screenshot showing the error messages Something changed with snapshot 20180818 I am now seeing these errors on every shutdown. This is with a KVM virtual machine, where I previously never could produced the problem. And I do have "plymouth.enable=0". Responding to c#26
My guess would be that disabling plymouth only hides the messages.
I disagree with that. My guess is that plymouth is using something from the root file system, perhaps a dynamic ".so" library. But now it is happening even with plymouth disabled. So some other component is now using something from the root file system. NOTE: I did check. I booted to rescue mode (with installer DVD), and an "fsck" on all file systems showed them to be clean. My personal opinion: this is all a mistake. The system should not try to disassemble device mapper setup. It should just reset the system and abandon the current setup. What makes the device mapper setup is all in volatile storage. So it goes away on reset. No need to tear it down. What's important is that everything has been remounted as read-only (or has been unmounted). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c28 --- Comment #28 from Martin Wilck <martin.wilck@suse.com> --- (In reply to Neil Rickert from comment #27)
I am now seeing these errors on every shutdown. This is with a KVM virtual machine, where I previously never could produced the problem. And I do have "plymouth.enable=0".
IIUC use see these messages for a while, then they stop and the reboot succeeds (how long? how many messages)? If actually some library from the root FS was used by some process running outside of it, then the error should persist and the remove ioctl should consistently fail, causing the shutdown process to time out.
My personal opinion: this is all a mistake. The system should not try to disassemble device mapper setup.
I'm not sure where this happens. I guess it's the stop job of "blk-availability.service". Could you try to mask that service and see if the problem disappears? Wrt this being a "mistake", the blkdeactivate script was introduced to LVM by Red Hat in 2012. I didn't find an explicit rationale, but as they're unlikely to have written that code without a purpose, I guess they were seeing data corruption of some sort if it's not done. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c29 --- Comment #29 from Neil Rickert <nwr10cst-oslnx@yahoo.com> ---
(how long? how many messages)?
They go past so fast, that it is very hard to tell. I would guess that it is around 100 messages. As soon as I could see the messages, I paused the VM so that I could take a screenshot of the paused virtual machine. Otherwise they would go past too fast to capture.
Could you try to mask that service and see if the problem disappears?
How and when should I do that? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c30 --- Comment #30 from Martin Wilck <martin.wilck@suse.com> --- (In reply to Neil Rickert from comment #29)
(how long? how many messages)?
They go past so fast, that it is very hard to tell.
So it's really a transient condition, and the problem is mainly the irritating log messages, no real damage done. Might be solvable by adding a short sleep somewhere (but I can't say where :-/ ).
Could you try to mask that service and see if the problem disappears?
How and when should I do that?
When the system is up, run "systemctl mask blk-availability.service". You probably need to reboot after that. But wait - first check whether that service is enabled at all ("systemctl status blk-availability.service"). If it isn't, we have to find another culprit. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c31 --- Comment #31 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- I agree that no real damage is done. It seems to be a harmless bug. ---- # systemctl status blk-availability.service ● blk-availability.service - Availability of block devices Loaded: loaded (/usr/lib/systemd/system/blk-availability.service; disabled;
Active: inactive (dead) --- I guess the service is not running. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c32 Martin Wilck <martin.wilck@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS --- Comment #32 from Martin Wilck <martin.wilck@suse.com> ---
I agree that no real damage is done.
OK, good. If you're willing to invest some more work nonetheless: As you're seeing it on a VM, maybe you could provide a serial console log taken with "systemd.log_target=console systemd.journald.forward_to_console=1 console=ttyS0,57600n8"? In libvirt, you can connect e.g. "screen" to the serial port of the VM (which appears as /dev/pts/$X device on the host, you can see it e.g. with "virsh dumpxml"), and log the screen session. qemu has various methods to redirect a virtual serial console to a file. The reason I'm asking is to identify at which point exactly in the shutdown sequence the messages occur. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c33 --- Comment #33 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- Yes, I'm willing to spend more time on this. Note, however, that I am relatively new to using VMs. So I might need a little time to get up to speed in setting up a serial console log. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c34 --- Comment #34 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- Created attachment 780732 --> http://bugzilla.opensuse.org/attachment.cgi?id=780732&action=edit serial console log Well, that was easier than expected. I have attached the serial console log. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c35 --- Comment #35 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- Created attachment 780777 --> http://bugzilla.opensuse.org/attachment.cgi?id=780777&action=edit Serial console log (second try) I see that I missed some of the instruction (the part about journal). So I have repeated this, and now there is a lot more output. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Maximilen Bullett <mlbullett@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mlbullett@gmail.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Matthias Gensler <gensler@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |gensler@gmx.de -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c36 --- Comment #36 from Achim Gratz <Stromeko@NexGo.DE> --- The problem has stopped appearing somewhere around the 4.18.5 kernel update on Tumbleweed and has not returned so far. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c37 --- Comment #37 from Neil Rickert <nwr10cst-oslnx@yahoo.com> ---
The problem has stopped appearing somewhere around the 4.18.5 kernel update on Tumbleweed and has not returned so far.
I am still seeing the problem. I have just updated Tumbleweed on three systems. And then I rebooted twice (to make sure that this was with everything fully updated to 20181110). I checked on the second reboot. (1) A virtual machine under KVM: Still seeing the problem. (2) A Lenovo ThinkServer: Still seeing the problem. (3) A Dell laptop: I am no longer seeing the problem on this system. I guess that means there's a hardware dependency on whether this happens. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c38 --- Comment #38 from Ronnie Bailey <purevw@wtxs.net> --- I also continue to see the messages, although they disappear during shutdown immediately after a kernel update. After the following boot, the problem is back. I might note that my practice is to always remove the existing kernel at the same time I update to a newer kernel and not wait for the system to do it at another time. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c39 Andrei Dziahel <develop7@develop7.info> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |develop7@develop7.info --- Comment #39 from Andrei Dziahel <develop7@develop7.info> --- *** Bug 1116154 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c40 - - <mikccc@tutanota.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mikccc@tutanota.com --- Comment #40 from - - <mikccc@tutanota.com> --- https://openqa.opensuse.org/tests/799295#step/shutdown/6 On my system: With 'rd.break=shutdown' (after the error messages) /oldroot remains mounted with a 'ro' option, but one time it was 'rw' (because something related with btrfs balance or fs sync timed out). It seems that dracut doesn't succeed after many failed retries but gives up. (Adding the 'rd.debug' option results in a complete freeze of my system every time) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Maximilen Bullett <mlbullett@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|mlbullett@gmail.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c41 --- Comment #41 from - - <mikccc@tutanota.com> --- There are 252 (42 * 6) lines of device-mapper's error messages in the Neil's log. Dracut tries 42 times and gives up: _cnt=0 while [ $_cnt -le 40 ]; do _check_shutdown && break _cnt=$(($_cnt+1)) done [ $_cnt -ge 40 ] && _check_shutdown final https://github.com/dracutdevs/dracut/blob/master/modules.d/99shutdown/shutdo... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c42 --- Comment #42 from - - <mikccc@tutanota.com> --- shutdown:/# /oldroot/usr/bin/lsof -R /oldroot COMMAND PID PPID USER FD TYPE DEVICE SIZE/OFF NODE NAME none 1454 2 root mem REG 0,44 2078080 181864 /oldroot/lib64/libc-2.27.so none 1454 2 root mem REG 0,44 181048 181856 /oldroot/lib64/ld-2.27.so lsof 2891 2889 root txt REG 0,44 168376 11130 /oldroot/usr/bin/lsof lsof 2892 2891 root txt REG 0,44 168376 11130 /oldroot/usr/bin/lsof -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c43 --- Comment #43 from - - <mikccc@tutanota.com> --- After running `kill -s TERM 1454` I could unmount /oldroot with no error. I found also a line identyfying the mysterious kthread named [none] in logs:
kernel: bpfilter: Loaded bpfilter_umh pid 1454
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Brian King <brking@linux.vnet.ibm.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |brking@linux.vnet.ibm.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Hanns-Joachim Uhl <hannsj_uhl@de.ibm.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hannsj_uhl@de.ibm.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c48 --- Comment #48 from Neil Rickert <nwr10cst-oslnx@yahoo.com> --- I have installed kernel 4.20.1-1.g5978cc8-default. After that, I have rebooted once and then shutdown once. Both reboot and shutdown were clean with this kernel. They used to be very noisy with those "remove ioctl" fails. This was in a KVM virtual machine. I have not tested on real hardware. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c49 --- Comment #49 from - - <mikccc@tutanota.com> --- First of all, I have to correct my last comment - the 'none' process wasn't a kthread, because it lacked the PF_KTHREAD flag. (In reply to Martin Wilck from comment #46)
Here's something of interest:
There is also a patch in systemd 240: https://github.com/systemd/systemd/commit/e45154c770567507cc51eb78b28a1fae1f... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c51 Andrei Dziahel <develop7@develop7.info> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(mikccc@tutanota.c | |om) | --- Comment #51 from Andrei Dziahel <develop7@develop7.info> ---
Could you please test with 4.20.1 as requested in comment 47?
Worked like a charm for me. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c52 --- Comment #52 from - - <mikccc@tutanota.com> --- (In reply to Martin Wilck from comment #50)
Could you please test with 4.20.1 as requested in comment 47?
I'm sorry, but I would prefer to wait for the release in the official repo. (Won't it appear any day now?) I had recently even more issues with the rd.break=shutdown option, so I'll try to give you a deatailed answer then, but I cannot guarantee it. Maybe could you consider adding an extra test to openqa, which could show you most reliably if the patches have had the desired effect? It would also be useful for detecting similar issues in the future.
There is also a patch in systemd 240: https://github.com/systemd/systemd/commit/ e45154c770567507cc51eb78b28a1fae1fcdf396
That one looks orthogonal with what we are discussing here. It deals with _not killing_ kernel threads, where we must ensure that systemd _does kill_ this user mode helper.
It corrects kthread distinguishing to ensure killing user mode helpers. Checking the PF_KTHREAD flag is the right solution IMO and dracut too should implement it. The old hacky way - checking if /proc/{PID}/cmdline is empty, isn't reliable enough as we all have seen ;) I consider the kernel patches as a workaround of the problem. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c53 Martin Wilck <martin.wilck@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |coolo@suse.com --- Comment #53 from Martin Wilck <martin.wilck@suse.com> --- (In reply to - - from comment #52)
Maybe could you consider adding an extra test to openqa, which could show you most reliably if the patches have had the desired effect? It would also be useful for detecting similar issues in the future.
I'm not aware of an OpenQA test that reproduces this behavior. Anyway, that's really not my area of expertise. @coolo, do you see a chance to do this? @all, do we have a fool-proof setup procedure for a VM to reproduce this problem?
I consider the kernel patches as a workaround of the problem.
Not sure what you meant to say, but judging from the other testers' feedback, it looks as if these patches were actually the solution to the problem described in this bug. I wouldn't call this a workaround. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c54 --- Comment #54 from - - <mikccc@tutanota.com> --- (In reply to Martin Wilck from comment #53)
I consider the kernel patches as a workaround of the problem.
Not sure what you meant to say, but judging from the other testers' feedback, it looks as if these patches were actually the solution to the problem described in this bug. I wouldn't call this a workaround.
Workarounds are solutions by definition. Poor solutions. Anyway, I can confirm that the issue has been solved in the Tumbleweed 20190115 snapshot: kernel-default-4.20.0-1.5.x86_64 systemd-239-3.1.x86_64 dracut-044.1-22.1.x86_64 The error messages have stopped appearing. /oldroot and /oldroot/var mounts: - both ro at the pre-shutdown breakpoint; - not listed at the shutdown breakpoint. The cmdline entry in /proc/<pid> for the bpfilter umh process contains "bpfilter_umh" followed by a null byte. The dir isn't available at the shutdown breakpoint (or - in other words - the process has been killed earlier, as expected).
@all, do we have a fool-proof setup procedure for a VM to reproduce this problem?
I suspect that the issue affects every system where bpfilter_umh is started, but verbose errors appear only if the dracut's shutdown hook is enabled. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c55 Martin Wilck <martin.wilck@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED --- Comment #55 from Martin Wilck <martin.wilck@suse.com> --- (In reply to - - from comment #54)
Workarounds are solutions by definition. Poor solutions.
So you consider the addition of the two kernel patches above a poor solution? Sorry to hear that, I thought it was quite a good one.
Anyway, I can confirm that the issue has been solved in the Tumbleweed 20190115 snapshot: kernel-default-4.20.0-1.5.x86_64 systemd-239-3.1.x86_64 dracut-044.1-22.1.x86_64
The error messages have stopped appearing.
OK, great.
@all, do we have a fool-proof setup procedure for a VM to reproduce this problem?
I suspect that the issue affects every system where bpfilter_umh is started, but verbose errors appear only if the dracut's shutdown hook is enabled.
I haven't yet figured out which service starts the bpfilter_umh process. Have you? I suspected firewalld, but I've seen bpfilter_umh running on systems with firewalld disabled. Closing the bug per comment 48, 51, and 54. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c56 --- Comment #56 from Ronnie Bailey <purevw@wtxs.net> --- I upgraded to kernel-default-4-20 this morning. After reboot into the new kernel, and shut-down, the problem still exists. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c57 Martin Wilck <martin.wilck@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED |--- --- Comment #57 from Martin Wilck <martin.wilck@suse.com> --- (In reply to Ronnie Bailey from comment #56)
I upgraded to kernel-default-4-20 this morning. After reboot into the new kernel, and shut-down, the problem still exists.
Sorry to hear that. Your problem must be different than the other people's ones then. Could you please try the debugging steps described in comment 40 and comment 42? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c58 --- Comment #58 from - - <mikccc@tutanota.com> --- (In reply to Martin Wilck from comment #55)
@all, do we have a fool-proof setup procedure for a VM to reproduce this problem?
I suspect that the issue affects every system where bpfilter_umh is started, but verbose errors appear only if the dracut's shutdown hook is enabled.
I haven't yet figured out which service starts the bpfilter_umh process. Have you? I suspected firewalld, but I've seen bpfilter_umh running on systems with firewalld disabled.
I suspect NetworkManager (on the basis of the order of log entries and boot messages). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c59 --- Comment #59 from Ronnie Bailey <purevw@wtxs.net> --- I set network management to "Wicked" via Yast. NetworkManager closed. I still got the messages on shutdown. Something curious though. On reboot, NetworkManager was running. I went to Yast and it still showed Wicked as the service to use. Earlier, I removed the bpfilter module. lsmod had showed it being used by 0. When I initiated shutdown, the first console message I got was "starting bpfilter". But during that shutdown there were no messages. I am not sure if it would make a difference, but I have never migrated to btrfs. My volumes are formatted as ext4. Shutdowns on my system typically take 90 seconds or so, which seems quite long to me. One of the final messages I get before the long delay is: "Stopping monitoring of LVM2 partitions" or something to that effect. After a very long delay, I get the error messages which flash by quickly, just before shutdown. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c60 --- Comment #60 from Martin Wilck <martin.wilck@suse.com> --- (In reply to Ronnie Bailey from comment #59)
I set network management to "Wicked" via Yast. NetworkManager closed. I still got the messages on shutdown. Something curious though. On reboot, NetworkManager was running. I went to Yast and it still showed Wicked as the service to use.
Different issue, please open a separate bug.
Earlier, I removed the bpfilter module. lsmod had showed it being used by 0. When I initiated shutdown, the first console message I got was "starting bpfilter". But during that shutdown there were no messages.
Could you try that after activating systemd debugging ("systemd-analyze log-level debug")? It might show us which service loads the module.
Shutdowns on my system typically take 90 seconds or so, which seems quite long to me.
Yes indeed. 90s is the standard systemd timeout for jobs that don't finish. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c61 --- Comment #61 from - - <mikccc@tutanota.com> --- (In reply to Martin Wilck from comment #55)
I haven't yet figured out which service starts the bpfilter_umh process. Have you? I suspected firewalld, but I've seen bpfilter_umh running on systems with firewalld disabled.
After a little research I found that it can be anything that calls `(set|get)sockopt`. https://github.com/torvalds/linux/commit/97adaddaa6db7a8af81b9b11e30cbe3628c... However, I agree with you that Ronnie's problem is different (probably completely unrelated to bpfilter_umh IMO) and debugging at the shutdown breakpoint would be helpful. Ronnie, first steps are described in https://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c11 In my experience, I would also suggest: - enabling the magic sysrq key (temporarily), - setting `[Manager] ShutdownWatchdogSec=0` in /run/systemd/system.conf.d/99-shutdown-watchdog.conf - not using the 'rd.debug' option, - trying to run commands, even if typed chars are not displayed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c62 --- Comment #62 from Martin Wilck <martin.wilck@suse.com> --- (In reply to - - from comment #61)
After a little research I found that it can be anything that calls `(set|get)sockopt`.
Thanks for digging that up. More precisely, getsockopt() or setsockopt() with level == IPPROTO_IP and optname supported by bpfilter (https://elixir.bootlin.com/linux/v5.0-rc4/source/include/uapi/linux/bpfilter...) automagically loads the bpfilter module and starts the UMH. This happens when iptables initializes itself: it calls getsockopt(IPPROTO_IP, IPT_SO_GET_INFO), and IPT_SO_GET_INFO has the same value as BPFILTER_IPT_SO_GET_INFO (=64). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 http://bugzilla.opensuse.org/show_bug.cgi?id=1080485#c63 filippos Filippos <filippos@filippides.eu> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |filippos@filippides.eu --- Comment #63 from filippos Filippos <filippos@filippides.eu> --- Same here until today -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 fons jongh <fons.dejongh@microfocus.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |1172006 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1080485 Björn Voigt <bjoernv@arcor.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bjoernv@arcor.de -- You are receiving this mail because: You are on the CC list for the bug.
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com