Bug ID 918507
Summary System services unable to start or stop, no reboot possible, zombied sshd processes
Classification openSUSE
Product openSUSE 13.1
Version Final
Hardware x86-64
OS openSUSE 13.1
Status NEW
Severity Critical
Priority P5 - None
Component Basesystem
Assignee bnc-team-screening@forge.provo.novell.com
Reporter paul.pech@gmx.de
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

Hi,

we're running 10 openSuse 13.1 servers, all fully patched. These servers have
been running for roughly a year now and no configuration files were changed
recently.

After the update from last Monday:

--
The following NEW patch is going to be installed:
  openSUSE-2015-149 

The following 6 packages are going to be upgraded:
  libgudev-1_0-0 libudev1 systemd systemd-32bit systemd-sysvinit udev 
--

Two of our servers have been rebooted. These two servers now show a very
peculiar behavior. Every 12-16 hours all services (apache, mysql, ...) are
running normally but trying to issue the following commands fails:

<code>
 service apache2 status
 service apache2 stop
 service sshd status
 service mysql stop
</code>

Output is:

No such service/target!?

Trying to reboot or shutdown also fails. Only things like

<code>
 echo 1 > /proc/sys/kernel/sysrq 
 echo b > /proc/sysrq-trigger
</code>

work. After the servers come back, all above mentioned command work fine again,
start log is clean (as far as I can tell). About 12-16 hours the problem as
above shows up again.

All services are running fine, but can't be stopped or their status queried.
With the exception of sshd. There are a couple of sshd processes that are
zombies (defunc).

This is in /var/log/messages when the servers are in this state and an ssh
login occurs:

<code>
2015-02-18T20:16:56.001419+01:00 fs1 sshd[4091]: Accepted
keyboard-interactive/pam for root from 217.251.***.*** port 48954 ssh2
2015-02-18T20:16:56.390724+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-c35.scope: Activation of org.free
desktop.systemd1 timed out org.freedesktop.DBus.Error.TimedOut
2015-02-18T20:17:09.928009+01:00 fs1 sshd[25667]: pam_systemd(sshd:session):
Failed to release session: Did not receive a reply. 
Possible causes include: the remote application did not send a reply, the
message bus security policy blocked the reply, the repl
y timeout expired, or the network connection was broken.
2015-02-18T20:17:09.931181+01:00 fs1 systemd-cgroups-agent[4096]: Failed to get
D-Bus connection: Failed to connect to socket /ru
n/systemd/private: Connection refused
2015-02-18T20:17:21.028887+01:00 fs1 sshd[4091]: pam_systemd(sshd:session):
Failed to create session: Did not receive a reply. Po
ssible causes include: the remote application did not send a reply, the message
bus security policy blocked the reply, the reply timeout expired, or the
network connection was broken.
2015-02-18T20:17:21.391049+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-c36.scope: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:17:46.391342+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-c37.scope: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken. org.freedesktop.DBus.Error.NoReply
</code>

Also, we have a couple of cron jobs that run every few minutes. This shows up
in the log files:

<code>
2015-02-18T20:22:26.932169+01:00 fs1 /USR/SBIN/CRON[4231]: (root) CMD
(/etc/ha.d/mysql_watcher3.php)
2015-02-18T20:22:26.932662+01:00 fs1 /USR/SBIN/CRON[4232]: (root) CMD
(/etc/health/healthd.sh)
2015-02-18T20:22:26.933064+01:00 fs1 /USR/SBIN/CRON[4233]: (root) CMD
(/etc/ha.d/watch_messages.php)
2015-02-18T20:22:31.308509+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-3387.scope: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:22:56.308636+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-c42.scope: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:23:21.309211+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-c43.scope: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:24:26.971793+01:00 fs1 systemd-logind[611]: Failed to start
session scope session-3391.scope: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:24:26.972637+01:00 fs1 /usr/sbin/cron[4243]:
pam_systemd(crond:session): Failed to create session: Input/output error
</code>

If I log in using sshd (which still works, even if the problems described above
are "active") and I try to get the list of currently logged in users like this

<code>
 systemd-loginctl list-sessions
</code>

this usually works the first time (only showing my session [no other sessions
should be present]) but stops working after 5 minutes of being logged in. Than
it just hangs until killed with ^C.


Also the log file is cluttered with messages like this:

<code>
2015-02-18T15:01:01.181226+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
2015-02-18T15:04:01.267349+01:00 fs1 systemd-logind[611]: message repeated 6
times: [ Failed to store session release timer fd]
2015-02-18T15:05:01.520119+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
2015-02-18T15:06:01.283288+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
2015-02-18T15:10:01.479541+01:00 fs1 systemd-logind[611]: message repeated 9
times: [ Failed to store session release timer fd]
2015-02-18T15:10:02.064196+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
2015-02-18T15:11:01.552170+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
2015-02-18T15:14:01.702136+01:00 fs1 systemd-logind[611]: message repeated 6
times: [ Failed to store session release timer fd]
2015-02-18T15:15:01.609761+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
2015-02-18T15:15:01.666773+01:00 fs1 systemd-logind[611]: Failed to store
session release timer fd
</code>


There is plenty of space left on all hard disks. Here's the output of

<code>
 cat /proc/mdstat

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[4](S)
      16779136 blocks super 1.0 [3/3] [UUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdd2[4](S) sdc2[3] sdb2[1] sda2[0]
      471606080 blocks super 1.0 [3/3] [UUU]
      bitmap: 1/4 pages [4KB], 65536KB chunk

unused devices: <none>
</code>


I really do need help with this, any input is greatly appreciated.


Yours


Paul


You are receiving this mail because: