Bug ID | 1219667 |
---|---|
Summary | Race condition preventing Nagios restart |
Classification | openSUSE |
Product | openSUSE Distribution |
Version | Leap 15.5 |
Hardware | x86-64 |
OS | openSUSE Leap 15.5 |
Status | NEW |
Severity | Major |
Priority | P5 - None |
Component | Other |
Assignee | screening-team-bugs@suse.de |
Reporter | technik@expeedo.de |
QA Contact | qa-bugs@suse.de |
Target Milestone | --- |
Found By | --- |
Blocker | --- |
Created attachment 872527 [details]
Journal excerpt of nagios restart failure
Our Nagios configuration gets updated automatically by our server management
system. A cron job checks for updates and restarts Nagios accordingly.
These automatic restarts began having issues after migrating our Nagios setup
to Leap 15.5 on new hardware, from previously 15.4 on a rather old hardware, so
I cannot tell if this has been introduced by 15.5 or if the effect has been
uncovered by the much faster machine.
Restart failures only occur occasionally and infrequently, and in every single
case a following manual service restart worked without issues.
In every case of a restart failure, the journal looked the same (last excerpt
attached), it always failed on:
------------------------------------------------------------------------
nagios-exec-start-post … chown: Zugriff auf '/var/lib/nagios/status.dat' nicht
möglich: Datei oder Verzeichnis nicht gefunden
------------------------------------------------------------------------
…and that file then indeed wasn't present.
The script /usr/lib/nagios/nagios-exec-start-post does the chown only
immediately after checking and touching for the file missing:
------------------------------------------------------------------------
# set default access rights for files and directories
for file in "$log_file" "$state_retention_file" "$status_file"; do
if [ ! -e "$file" ]; then
touch "$file"
fi
chown --no-dereference ${nagios_user}:${nagios_cmdgrp} "$file"
done
------------------------------------------------------------------------
So there seems to be a race condition between the script and the already
running nagios main process, which seems to sometimes delete the status file
just before the script's chown.
I've tagged this with severity "major", because it causes our previously stable
monitoring service to fail and require manual intervention.
We're now testing service reload instead of restart, which seems to not have
that issue. Not sure if we can use reload for production though, as we had
issues with reload on complex nagios configuration changes in the past.
Regards,
Michael Balzer