[Bug 918507] New: System services unable to start or stop, no reboot possible, zombied sshd processes
http://bugzilla.novell.com/show_bug.cgi?id=918507 Bug ID: 918507 Summary: System services unable to start or stop, no reboot possible, zombied sshd processes Classification: openSUSE Product: openSUSE 13.1 Version: Final Hardware: x86-64 OS: openSUSE 13.1 Status: NEW Severity: Critical Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: paul.pech@gmx.de QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Hi, we're running 10 openSuse 13.1 servers, all fully patched. These servers have been running for roughly a year now and no configuration files were changed recently. After the update from last Monday: -- The following NEW patch is going to be installed: openSUSE-2015-149 The following 6 packages are going to be upgraded: libgudev-1_0-0 libudev1 systemd systemd-32bit systemd-sysvinit udev -- Two of our servers have been rebooted. These two servers now show a very peculiar behavior. Every 12-16 hours all services (apache, mysql, ...) are running normally but trying to issue the following commands fails: <code> service apache2 status service apache2 stop service sshd status service mysql stop </code> Output is: No such service/target!? Trying to reboot or shutdown also fails. Only things like <code> echo 1 > /proc/sys/kernel/sysrq echo b > /proc/sysrq-trigger </code> work. After the servers come back, all above mentioned command work fine again, start log is clean (as far as I can tell). About 12-16 hours the problem as above shows up again. All services are running fine, but can't be stopped or their status queried. With the exception of sshd. There are a couple of sshd processes that are zombies (defunc). This is in /var/log/messages when the servers are in this state and an ssh login occurs: <code> 2015-02-18T20:16:56.001419+01:00 fs1 sshd[4091]: Accepted keyboard-interactive/pam for root from 217.251.***.*** port 48954 ssh2 2015-02-18T20:16:56.390724+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c35.scope: Activation of org.free desktop.systemd1 timed out org.freedesktop.DBus.Error.TimedOut 2015-02-18T20:17:09.928009+01:00 fs1 sshd[25667]: pam_systemd(sshd:session): Failed to release session: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the repl y timeout expired, or the network connection was broken. 2015-02-18T20:17:09.931181+01:00 fs1 systemd-cgroups-agent[4096]: Failed to get D-Bus connection: Failed to connect to socket /ru n/systemd/private: Connection refused 2015-02-18T20:17:21.028887+01:00 fs1 sshd[4091]: pam_systemd(sshd:session): Failed to create session: Did not receive a reply. Po ssible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2015-02-18T20:17:21.391049+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c36.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply 2015-02-18T20:17:46.391342+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c37.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply </code> Also, we have a couple of cron jobs that run every few minutes. This shows up in the log files: <code> 2015-02-18T20:22:26.932169+01:00 fs1 /USR/SBIN/CRON[4231]: (root) CMD (/etc/ha.d/mysql_watcher3.php) 2015-02-18T20:22:26.932662+01:00 fs1 /USR/SBIN/CRON[4232]: (root) CMD (/etc/health/healthd.sh) 2015-02-18T20:22:26.933064+01:00 fs1 /USR/SBIN/CRON[4233]: (root) CMD (/etc/ha.d/watch_messages.php) 2015-02-18T20:22:31.308509+01:00 fs1 systemd-logind[611]: Failed to start session scope session-3387.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply 2015-02-18T20:22:56.308636+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c42.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply 2015-02-18T20:23:21.309211+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c43.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply 2015-02-18T20:24:26.971793+01:00 fs1 systemd-logind[611]: Failed to start session scope session-3391.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply 2015-02-18T20:24:26.972637+01:00 fs1 /usr/sbin/cron[4243]: pam_systemd(crond:session): Failed to create session: Input/output error </code> If I log in using sshd (which still works, even if the problems described above are "active") and I try to get the list of currently logged in users like this <code> systemd-loginctl list-sessions </code> this usually works the first time (only showing my session [no other sessions should be present]) but stops working after 5 minutes of being logged in. Than it just hangs until killed with ^C. Also the log file is cluttered with messages like this: <code> 2015-02-18T15:01:01.181226+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd 2015-02-18T15:04:01.267349+01:00 fs1 systemd-logind[611]: message repeated 6 times: [ Failed to store session release timer fd] 2015-02-18T15:05:01.520119+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd 2015-02-18T15:06:01.283288+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd 2015-02-18T15:10:01.479541+01:00 fs1 systemd-logind[611]: message repeated 9 times: [ Failed to store session release timer fd] 2015-02-18T15:10:02.064196+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd 2015-02-18T15:11:01.552170+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd 2015-02-18T15:14:01.702136+01:00 fs1 systemd-logind[611]: message repeated 6 times: [ Failed to store session release timer fd] 2015-02-18T15:15:01.609761+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd 2015-02-18T15:15:01.666773+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd </code> There is plenty of space left on all hard disks. Here's the output of <code> cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[4](S) 16779136 blocks super 1.0 [3/3] [UUU] bitmap: 0/1 pages [0KB], 65536KB chunk md1 : active raid1 sdd2[4](S) sdc2[3] sdb2[1] sda2[0] 471606080 blocks super 1.0 [3/3] [UUU] bitmap: 1/4 pages [4KB], 65536KB chunk unused devices: <none> </code> I really do need help with this, any input is greatly appreciated. Yours Paul -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=918507 Utto Huber <paul.pech@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P1 - Urgent -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=918507 --- Comment #2 from Utto Huber <paul.pech@gmx.de> --- As Thomas wrote, this really seems to start with a segfault in systemd. This is our log content: 2015-02-18T08:16:43.892154+01:00 fs1-1 kernel: [404687.140461] systemd[1]: segfault at a8 ip 000000000047912e sp 00007fffd0db7110 error 4 in systemd[400000+ed000] 2015-02-18T08:16:44.352680+01:00 fs1-1 systemd[1]: Caught <SEGV>, dumped core as pid 19512. 2015-02-18T08:16:44.411008+01:00 fs1-1 systemd[1]: Freezing execution. 2015-02-18T08:16:01.955797+01:00 fs1-1 systemd-logind[12034]: message repeated 4 times: [ Failed to store session release timer fd] 2015-02-18T08:18:26.939222+01:00 fs1-1 systemd-logind[12034]: Failed to start unit user-0.slice: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2015-02-18T08:18:26.939762+01:00 fs1-1 systemd-logind[12034]: Failed to start user slice: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2015-02-18T08:18:26.940061+01:00 fs1-1 /usr/sbin/cron[19515]: pam_systemd(crond:session): Failed to create session: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2015-02-16T23:31:56.333429+01:00 fs1-1 dbus[592]: message repeated 3 times: [ [system] Reloaded configuration] 2015-02-18T08:18:26.940421+01:00 fs1-1 dbus[592]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out Does anybody know whether reverting back to systemd 208-23.3 will help? Next time our servers start misbehaving I'll give it try and give feedback here. Paul -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=918507 Markus Zimmermann <markus.zimmermann@nethead.at> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |markus.zimmermann@nethead.a | |t -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com