[Bug 1072492] New: LVM2 metadata and Device-mapper event daemon do not stop properly on shutdown; SystemD killing them causes Data Integrity issues
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492 Bug ID: 1072492 Summary: LVM2 metadata and Device-mapper event daemon do not stop properly on shutdown; SystemD killing them causes Data Integrity issues Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: bebl@mageta.org QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0 Build Identifier: I am running openSuse Tumbleweed (VERSION="20171203"). Every time I shutdown or reboot my system it hangs on the LVM2 metadata and Device-mapper event daemon. It takes the default systemd 1:30m timeout and then systemd kills them with signal 9. I can reproduce this on 2 independent systems, where I use the same hard-disk layout (my guess is, it is because of the partition/hard-disk layout). This is the schematics of the layout I use on both systems (I created this during installation purely with yast!): /dev/sdd │596.17 GiB│ WDC WD64-00AAKS-00A7B2│ │ │ │ /dev/sdd1 │164.00 MiB│ EFI boot │FAT │EFI │/boot/efi │ /dev/sdd2 │ 2.00 GiB│ Linux native │Ext2 │boot │/boot │ /dev/sdd3 │594.01 GiB│ Linux LVM │ │ │ │ /dev/openSuse │594.01 GiB│ LVM2 openSuse │ │ │ │ <== using /dev/sdd3 as sole PV /dev/openSuse/pool│593.00 GiB│ LV │ │ │ │ /dev/openSuse/root│128.00 GiB│ LV │XFS │root │/ │ <== using the pool /dev/openSuse/pool /dev/openSuse/swap│ 24.00 GiB│ LV │Swap │swap │swap │ <== using the pool /dev/openSuse/pool On the system I am writing this on, this has not yet caused big troubles, other than that the shutdown takes long; on the other system I experienced data-loss - both the swap volume and the XFS root got corrupted, while the EFI and Boot partition stayed in good shape. So this seems to be a Data Integrity issue here, hence why the report and the severity. I captured systemd's debug output during shutdown and will attach that. Otherwise I am not sure what you need. Reproducible: Always -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c1
--- Comment #1 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c2
--- Comment #2 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c3
--- Comment #3 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c4
--- Comment #4 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c10
--- Comment #10 from Benjamin Block
If it's not hard for you, can you try to enable/start lvm2-monitor.service to see if it helps? I notice we disabled the service by default, could be irrelevant:
``` # systemctl status lvm2-monitor ● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; disabled; vendor preset: disabled) ```
That did in fact amend the hang on boot. (In reply to zhen ren from comment #7)
Comment on attachment 752659 [details] Log during shutdown
Thanks for reporting. Little busy with other issues, not inspect the logs carefully yet. I will try to reproduce locally soon.
Asking myself: why these DM device are busy?
``` [ 250.780540] systemd-shutdown[1]: Could not detach DM /dev/dm-2: Device or resource busy [ 250.780971] systemd-shutdown[1]: Detaching DM 254:1. [ 250.781394] systemd-shutdown[1]: Could not detach DM /dev/dm-1: Device or resource busy [ 250.781822] systemd-shutdown[1]: Detaching DM 254:0. [ 250.782249] systemd-shutdown[1]: Could not detach DM /dev/dm-0: Device or resource busy [ 250.782679] systemd-shutdown[1]: Not all DM devices detached, 4 left. [ 250.783094] systemd-shutdown[1]: Cannot finalize remaining DM devices, continuing. ```
That however did not go away. I am not sure whether that is relevant though. Before the last reboot I took a snapshot on what devices where what: ``` bblock@austri:~> lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 232.9G 0 disk ├─sda1 8:1 0 450M 0 part ├─sda2 8:2 0 100M 0 part ├─sda3 8:3 0 16M 0 part ├─sda4 8:4 0 212.1G 0 part └─sda5 8:5 0 855M 0 part sdb 8:16 0 2.7T 0 disk ├─sdb1 8:17 0 128M 0 part └─sdb2 8:18 0 2.7T 0 part sdc 8:32 0 2.7T 0 disk ├─sdc1 8:33 0 128M 0 part └─sdc2 8:34 0 2.7T 0 part sdd 8:48 0 596.2G 0 disk ├─sdd1 8:49 0 164M 0 part /boot/efi ├─sdd2 8:50 0 2G 0 part /boot └─sdd3 8:51 0 594G 0 part ├─openSuse-pool_tmeta 254:0 0 76M 0 lvm │ └─openSuse-pool-tpool 254:2 0 593G 0 lvm │ ├─openSuse-swap 254:3 0 24G 0 lvm [SWAP] │ ├─openSuse-root 254:4 0 128G 0 lvm / │ └─openSuse-pool 254:5 0 593G 0 lvm └─openSuse-pool_tdata 254:1 0 593G 0 lvm └─openSuse-pool-tpool 254:2 0 593G 0 lvm ├─openSuse-swap 254:3 0 24G 0 lvm [SWAP] ├─openSuse-root 254:4 0 128G 0 lvm / └─openSuse-pool 254:5 0 593G 0 lvm sr0 11:0 1 1024M 0 rom bblock@austri:~> ll /dev/mapper/ total 0 crw------- 1 root root 10, 236 Dec 14 22:00 control lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool -> ../dm-5 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tdata -> ../dm-1 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tmeta -> ../dm-0 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool-tpool -> ../dm-2 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-root -> ../dm-4 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-swap -> ../dm-3 ``` I will attach the shutdown log and journal again for the reboot after I took this data. If you wanna look into that. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c11
--- Comment #11 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c12
--- Comment #12 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c13
Franck Bui
lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tdata -> ../dm-1 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tmeta -> ../dm-0 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool-tpool -> ../dm-2
Don't know what they are for but apparently they're still in used and at this point this is probably not expected. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c14
--- Comment #14 from zhen ren
(In reply to Benjamin Block from comment #10)
lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tdata -> ../dm-1 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tmeta -> ../dm-0 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool-tpool -> ../dm-2
Don't know what they are for but apparently they're still in used and at this point this is probably not expected.
Hi Bui, yes, but I see the error message in comment#9 is more suspicious, could you take a look? thanks. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c15
--- Comment #15 from Franck Bui
@Franck, could you can help interpret this error message?
``` grep -n "Failed to send unit remove signal" shutdown-journal.txt | egrep "dev-mapper-openSuse|dm-event" 5507:Dec 10 17:05:19 austri systemd[1]: dev-mapper-openSuse\x2dpool_tdata.device: Failed to send unit remove signal for dev-mapper-openSuse\x2dpool_tdata.device: Transport endpoint is not connected 5542:Dec 10 17:05:19 austri systemd[1]: dev-mapper-openSuse\x2droot.device: Failed to send unit remove signal for dev-mapper-openSuse\x2droot.device:
Those *debug* messages only indicates that systemd failed to emit signals on D-Bus bus probably because dbus has been stopped. These happen every times the system is shutdown so nothing suspicious here. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c16
--- Comment #16 from Benjamin Block
(In reply to Franck Bui from comment #13)
(In reply to Benjamin Block from comment #10)
lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tdata -> ../dm-1 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tmeta -> ../dm-0 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool-tpool -> ../dm-2
Don't know what they are for but apparently they're still in used and at this point this is probably not expected.
Hi Bui, yes, but I see the error message in comment#9 is more suspicious, could you take a look? thanks.
Any idea how I could find out what is having references on them open? I have a script running collecting the logs I attach here short before shutdown. I could run something and write our information about the use. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c17
--- Comment #17 from zhen ren
Those *debug* messages only indicates that systemd failed to emit signals on D-Bus bus probably because dbus has been stopped.
These happen every times the system is shutdown so nothing suspicious here.
Hi Franck, thanks for your confirmation. So probably, the reason why there's a lot such error message in the log is because the verbose knob for systemd was enabled. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c18
--- Comment #18 from zhen ren
(In reply to zhen ren from comment #14)
(In reply to Franck Bui from comment #13)
(In reply to Benjamin Block from comment #10)
lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tdata -> ../dm-1 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool_tmeta -> ../dm-0 lrwxrwxrwx 1 root root 7 Dec 14 22:00 openSuse-pool-tpool -> ../dm-2
Don't know what they are for but apparently they're still in used and at this point this is probably not expected.
Hi Bui, yes, but I see the error message in comment#9 is more suspicious, could you take a look? thanks.
Any idea how I could find out what is having references on them open? I have a script running collecting the logs I attach here short before shutdown. I could run something and write our information about the use.
Usually, you can use `lsof /dev/dm-1` to list who is using the device, but it fails in my testing: ws:~ # lsof /dev/dm-0 lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. I think you can try to use the mountpoint of dm-* instead like: #lsof / ... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c19
--- Comment #19 from zhen ren
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c20
--- Comment #20 from zhen ren
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c30
--- Comment #30 from Benjamin Block
Hi Benjamin,
According to my testing, the following commands solve my problem:
systemctl enable lvm2-monitor.service systemctl start lvm2-monitor.service
could you please update to the latest release and have a try?
Sry I don't have access to the system right now. I can retest next week. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c31
--- Comment #31 from zhen ren
(In reply to zhen ren from comment #22)
According to my testing, the following commands solve my problem:
systemctl enable lvm2-monitor.service systemctl start lvm2-monitor.service
could you please update to the latest release and have a try?
Any idea *why* this is so? enabling lvm2-monitor.service by default is nothing more than a workaround until we understand why it changes the shutdown behavior.
Hi Martin, By default, when a thin/snapshot/raid LV is created, dm-event.service starts dmeventd daemon for monitoring, which creates a monitoring thread for the LV: """ lvm2/daemons/dmeventd/dmeventd.c: 1570 static int _handle_request(struct dm_event_daemon_message *msg, 1571 struct message_data *message_data) 1572 { 1573 switch (msg->cmd) { 1574 case DM_EVENT_CMD_REGISTER_FOR_EVENT: 1575 if (!message_data->events_field) 1576 return -EINVAL; 1577 return _register_for_event(message_data); """ When dm-event.service is going be stopped, dmeventd will cleanup the monitoring threads until it can exit: """ lvm2/daemons/dmeventd/dmeventd.c: 2147 int main(int argc, char *argv[]) 2148 { ... 2251 for (;;) { ... 2272 } else if (_exit_now == DM_SIGNALED_EXIT) { 2273 _exit_now = DM_SCHEDULED_EXIT; 2274 /* 2275 * When '_exit_now' is set, signal has been received, 2276 * but can not simply exit unless all 2277 * threads are done processing. 2278 */ 2279 log_info("dmeventd received break, scheduling exit."); 2280 } 2281 _process_request(&fifos); 2282 _cleanup_unused_threads(); 2283 } """ And, lvm2-monitor.service is used to stop the monitoring: """ In /usr/lib/systemd/system/lvm2-monitor.service: ... 13 ExecStart=/sbin/lvm vgchange --monitor y --ignoreskippedcluster 14 ExecStop=/sbin/lvm vgchange --monitor n --ignoreskippedcluster """ So, lvm2-monitor.service should be stopped first before stopping dm-event.service, otherwise dm-event.service stopping will hang. When lvm2-monitor.service is disabled, nobody to stop the monitoring of those snapshot/thin/raid/mirror LV. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c32
--- Comment #32 from Benjamin Block
Hi Benjamin,
According to my testing, the following commands solve my problem:
systemctl enable lvm2-monitor.service systemctl start lvm2-monitor.service
could you please update to the latest release and have a try?
I am at home again and retested stuff after installing updates on the installation. Seems to work fine now. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c33
--- Comment #33 from Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c36
Benjamin Block
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c37
--- Comment #37 from Martin Wilck
When lvm2-monitor.service is disabled, nobody to stop the monitoring of those snapshot/thin/raid/mirror LV.
Allright. But: # systemctl status lvm2-monitor.service lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; disabled; vendor preset: disabled) "vendor preset" should be "enabled" if this service is so vital. It can't hurt much AFAICS. This would be the bug, then. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492
http://bugzilla.opensuse.org/show_bug.cgi?id=1072492#c38
--- Comment #38 from zhen ren
"vendor preset" should be "enabled" if this service is so vital. It can't hurt much AFAICS. This would be the bug, then.
Yes, it's already done in this PR: https://build.opensuse.org/request/show/560488 But, not sure why OBS notification hasn't been received in this bug. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com