[Bug 960118] New: corosync: service randomly fails to start with a faulty redundant channel
http://bugzilla.suse.com/show_bug.cgi?id=960118 Bug ID: 960118 Summary: corosync: service randomly fails to start with a faulty redundant channel Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.1 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: High Availability Assignee: lmb@suse.com Reporter: zzhou@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 660241 --> http://bugzilla.suse.com/attachment.cgi?id=660241&action=edit journalctl HA system is in Virtual Machines. It is configured with the redundant channel for corosync. While, intentionally let one corosync channel blocked from the host system. OBS131-x220:/work/images # brctl delif virbr0 vnet2 In this case, all NIC instances in VM are active. Then, corosync service randomly fails to start, as below. Leap421-01:~ # systemctl stop pacemaker Leap421-01:~ # systemctl start pacemaker A dependency job for pacemaker.service failed. See 'journalctl -xn' for details. Leap421-01:~ # systemctl start pacemaker Leap421-01:~ # systemctl stop pacemaker Leap421-01:~ # systemctl start pacemaker Leap421-01:~ # systemctl stop pacemaker Leap421-01:~ # systemctl start pacemaker A dependency job for pacemaker.service failed. See 'journalctl -xn' for details. Leap421-01:~ # systemctl start pacemaker Leap421-01:~ # systemctl stop pacemaker Leap421-01:~ # systemctl start pacemaker A dependency job for pacemaker.service failed. See 'journalctl -xn' for details. Leap421-01:~ # systemctl start pacemaker A dependency job for pacemaker.service failed. See 'journalctl -xn' for details. Leap421-01:~ # systemctl start pacemaker Leap421-01:~ # systemctl stop pacemaker Leap421-01:~ # systemctl start pacemaker A dependency job for pacemaker.service failed. See 'journalctl -xn' for details. Leap421-01:~ # journalctl -xn -- Logs begin at Thu 2015-11-05 16:53:16 CST, end at Tue 2015-12-22 21:01:39 CST. -- Dec 22 21:01:39 Leap421-01 corosync[8851]: [QB ] withdrawing server sockets Dec 22 21:01:39 Leap421-01 corosync[8851]: [SERV ] Service engine unloaded: corosync configuration service Dec 22 21:01:39 Leap421-01 corosync[8851]: [QB ] withdrawing server sockets Dec 22 21:01:39 Leap421-01 corosync[8851]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Dec 22 21:01:39 Leap421-01 corosync[8851]: [QB ] withdrawing server sockets Dec 22 21:01:39 Leap421-01 corosync[8851]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Dec 22 21:01:39 Leap421-01 corosync[8851]: [SERV ] Service engine unloaded: corosync profile loading service Dec 22 21:01:39 Leap421-01 corosync[8851]: [MAIN ] Corosync Cluster Engine exiting normally Dec 22 21:01:39 Leap421-01 systemd[1]: Failed to start Corosync Cluster Engine. -- Subject: Unit corosync.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit corosync.service has failed. -- -- The result is failed. Dec 22 21:01:39 Leap421-01 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager. -- Subject: Unit pacemaker.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit pacemaker.service has failed. -- -- The result is dependency. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c1 --- Comment #1 from Roger Zhou <zzhou@suse.com> --- Created attachment 660242 --> http://bugzilla.suse.com/attachment.cgi?id=660242&action=edit corosync.conf journalctl and corosync.conf are in attachment -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 Roger Zhou <zzhou@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|lmb@suse.com |bliu@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c2 --- Comment #2 from Bin Liu <bliu@suse.com> --- (In reply to Roger Zhou from comment #1)
Created attachment 660242 [details] corosync.conf
journalctl and corosync.conf are in attachment
Using the command "brctl delif virbr0 vnet2", the vm can still have its IP, and the link status of "ethtool " is also yes. Need to check how corosync check whether an interface is available. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c3 --- Comment #3 from Bin Liu <bliu@suse.com> --- (In reply to Roger Zhou from comment #1)
Created attachment 660242 [details] corosync.conf
journalctl and corosync.conf are in attachment
Is "brctl delif virbr0 vnet2" before start pacemaker, or after pacemaker started? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c4 --- Comment #4 from Roger Zhou <zzhou@suse.com> --- (In reply to Bin Liu from comment #3)
Is "brctl delif virbr0 vnet2" before start pacemaker, or after pacemaker started?
That is outside of VM in host system, before pacemaker restart inside of VM. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c5 --- Comment #5 from Roger Zhou <zzhou@suse.com> --- Further more testing, the problem scenario is true when the redundant link of DC is down, and then the other node couldn't re-join back. That said, there are chance to put this two-node cluster under risk, and this bug need be fixed. The frequency to reproduce this problem is different for the different operation. Reproduce Approach 1, 50% chance, by # systemctl stop pacemaker # systemctl start pacemaker Reproduce Approach 2, <20% chance, by # reboot Reproduce Approach 3, <10 chance, by # systemctl restart pacemaker -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c6 --- Comment #6 from Roger Zhou <zzhou@suse.com> --- Created attachment 660332 --> http://bugzilla.suse.com/attachment.cgi?id=660332&action=edit crm report -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c7 --- Comment #7 from Bin Liu <bliu@suse.com> --- from totemudp.c, I found the function to build socket connect, /* * If the interface is up, the sockets for totem are built. If the interface is down * this function is requeued in the timer list to retry building the sockets later. */ static void timer_function_netif_check_timeout. And if you execute "brctl delif virbr0 vnet2", the network state is still up( corosync gets link state by calling getifaddrs), so the vm kernel still think network is still up. And I need to confirm how corosync selects an interface to run. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c8 --- Comment #8 from Bin Liu <bliu@suse.com> --- Another condition is that I used the following scripts to run corosync binary got more that 60 times, there is no error, while use systemctl could occur the issue: #! /usr/bin/bash while [ 1 -eq 1 ] do corosync -f & rtn=$? echo "corosync -f& returned $rtn" if [ $rtn -ne 0 ]; then exit $rtn fi sleep 5 corosync-cfgtool -s sleep 5 killall -SIGTERM corosync #rm -rf /var/run/corosync.pid #killall -9 corosync rtn=$? echo "killall -SIGTERM corosync returned $rtn" if [ $rtn -ne 0 ]; then exit $rtn fi sleep 10 done and from the log I found the following record: Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [MAIN ] main.c:242 Node was shut down by a signal Jan 13 13:52:16 pacemaker-cts-c3 corosync[29618]: Starting Corosync Cluster Engine (corosync): [FAILED] Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:373 Unloading all Corosync service engines. Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipc_setup.c:452 withdrawing server sockets Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipcs.c:229 qb_ipcs_unref() - destroying Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:240 Service engine unloaded: corosync vote quorum service v1.0 Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipc_setup.c:452 withdrawing server sockets Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipcs.c:229 qb_ipcs_unref() - destroying Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:240 Service engine unloaded: corosync configuration map access Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipc_setup.c:452 withdrawing server sockets Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipcs.c:229 qb_ipcs_unref() - destroying Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:240 Service engine unloaded: corosync configuration service Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipc_setup.c:452 withdrawing server sockets Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipcs.c:229 qb_ipcs_unref() - destroying Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:240 Service engine unloaded: corosync cluster closed process group service v1.01 Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipc_setup.c:452 withdrawing server sockets Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [QB ] ipcs.c:229 qb_ipcs_unref() - destroying Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:240 Service engine unloaded: corosync cluster quorum service v0.1 Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [SERV ] service.c:240 Service engine unloaded: corosync profile loading service Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [TOTEM ] totemsrp.c:3325 sending join/leave message Jan 13 13:52:16 pacemaker-cts-c3 corosync[29627]: [MAIN ] util.c:131 Corosync Cluster Engine exiting normally Jan 13 13:52:16 pacemaker-cts-c3 systemd[1]: Dependency failed for Pacemaker High Availability Cluster Manager. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960118 http://bugzilla.suse.com/show_bug.cgi?id=960118#c9 Tomáš Chvátal <tchvatal@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |WONTFIX --- Comment #9 from Tomáš Chvátal <tchvatal@suse.com> --- This is automated batch bugzilla cleanup. The openSUSE 42.1 changed to end-of-life (EOL [1]) status. As such it is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of openSUSE, or you can still observe it under openSUSE Leap 15.0, please feel free to reopen this bug against that version (see the "Version" component in the bug fields), or alternatively open a new ticket. Thank you for reporting this bug and we are sorry it could not be fixed during the lifetime of the release. [1] https://en.opensuse.org/Lifetime -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com