[Bug 958346] New: systemd hangs/dies randomly after weeks of runtime
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 Bug ID: 958346 Summary: systemd hangs/dies randomly after weeks of runtime Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.1 Hardware: x86-64 OS: openSUSE 42.1 Status: NEW Severity: Major Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: robin.roth@kit.edu QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Symptom:
From our about 50 machines (all identical setup, different hardware) running 42.1, within 2 weeks of the last reboot about 10 are affected. At some point all calls to systemd fail, like dbus[882]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out There are no other related log messages hinting to a problem in systemd before.
This seems to happen independently of the use of the machine. Systemd won't respond to anything after that. All systemctl calls fail, ''kill 1'' and ''kill -9 1'' don't work, also reboot/shutdown won't work. Rebooting the machine fixes the problem temporarily. Do you have suggestions how to debug this? So far we haven't found a way to trigger the issue and waiting weeks with many machines potentially failing isn't a nice option. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c1 --- Comment #1 from Robin Roth <robin.roth@kit.edu> --- The behaviour looks similar like the one after https://bugzilla.opensuse.org/show_bug.cgi?id=918226. daemon-reload and daemon-reexec also don't work (time out...) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c2 Bernhard Wiedemann <bwiedemann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bwiedemann@suse.com Assignee|bnc-team-screening@forge.pr |systemd-maintainers@suse.de |ovo.novell.com | --- Comment #2 from Bernhard Wiedemann <bwiedemann@suse.com> --- here are some pointers https://en.opensuse.org/SDB:Systemd#Getting_debug_from_systemd http://freedesktop.org/wiki/Software/systemd/Debugging/ also interesting: are the 10 of 50 a truly random selection or is it always from a smaller pool of machines? If the latter, it might be worth checking if there is a pattern standing out (e.g. different hardware, setup or workload) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c3 --- Comment #3 from Robin Roth <robin.roth@kit.edu> --- Thanks for the hints. Since looking into the issue we have log-level=debug as well as some custom monitoring on all machines and try to catch one that fails "in-the-act". There is no pattern to the machines failing, desktops of different hardware, a server. We now have a machine on 13.2 with the same symptoms. One thing that is common to all machines is a custom service running that has an underscore in it's name. CoreOS had a bug, but that's code that is as far as I understand is not present in openSuSE (https://github.com/coreos/fleet/issues/579) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c4 --- Comment #4 from Robin Roth <robin.roth@kit.edu> --- Now some machines failed with log-level=debug. This also happends after upgrading to systemd 228, so I opened a bug report upstream: https://github.com/systemd/systemd/issues/2200. I put more details there. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c5 Howard Guo <hguo@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hguo@suse.com --- Comment #5 from Howard Guo <hguo@suse.com> --- Hello Robin. I have got similar failures more consistently on a different hardware platform. I have several servers running on KVM, the ones with severely capped IO throughout can easily reproduce the issue: 1. Cap the IO throughout to about 5MB/s 2. Enable swap file (increase demand for IO throughout) 3. Create heavy IO congestion by launching an IO and memory intensive operation, it must be small enough not to trigger OOM but large enough to evict almost all file cache. The system load climbs to 20 for a single CPU system. 4. Issue a systemctl command such as stopping a unit, while the above operation is in progress, observe a timeout due to heavy system load. 5. Stop the IO congestion and wait several seconds, then reissue the systemctl command. There is a good chance of timeout and all further systemctl commands always timeout. While I do not know enough about systemd to understand what went wrong, but I could work around it by running the operation in a systemd unit file with very low IO and CPU scheduling priority. I'm curious to know, what sort of workload do the machines run ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c6 --- Comment #6 from Robin Roth <robin.roth@kit.edu> --- Hi Howard, our issues might be related, but yours sounds a lot more reproducible. Our workload is diverse and one of the affected machines was idle, so it's not only high io, but something else. Nevertheless it might be related. I'm still trying to get something reproducible here. This sounds more like your issue: https://github.com/systemd/systemd/issues/1353 You should probably open an issue with systemd. They are quite responsive and if you have something reproducible there is a good chance to get it fixed. (In reply to Howard Guo from comment #5)
Hello Robin.
I have got similar failures more consistently on a different hardware platform. I have several servers running on KVM, the ones with severely capped IO throughout can easily reproduce the issue:
1. Cap the IO throughout to about 5MB/s 2. Enable swap file (increase demand for IO throughout) 3. Create heavy IO congestion by launching an IO and memory intensive operation, it must be small enough not to trigger OOM but large enough to evict almost all file cache. The system load climbs to 20 for a single CPU system. 4. Issue a systemctl command such as stopping a unit, while the above operation is in progress, observe a timeout due to heavy system load. 5. Stop the IO congestion and wait several seconds, then reissue the systemctl command. There is a good chance of timeout and all further systemctl commands always timeout.
While I do not know enough about systemd to understand what went wrong, but I could work around it by running the operation in a systemd unit file with very low IO and CPU scheduling priority.
I'm curious to know, what sort of workload do the machines run ?
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=958346 http://bugzilla.opensuse.org/show_bug.cgi?id=958346#c7 Robin Roth <robin.roth@kit.edu> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |UPSTREAM --- Comment #7 from Robin Roth <robin.roth@kit.edu> --- As discussed in https://github.com/systemd/systemd/issues/2200#issuecomment-168606293 the issue was a rather complex timer setup. While it shouldn't crash systemd it's easily avoidable. Also I am still not able to get a reproducible setup that immediately crashes systemd. Therefore this can be closed (in my opinion). -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com