Created attachment 872527 [details]
Journal excerpt of nagios restart failure

Our Nagios configuration gets updated automatically by our server management
system. A cron job checks for updates and restarts Nagios accordingly.

These automatic restarts began having issues after migrating our Nagios setup
to Leap 15.5 on new hardware, from previously 15.4 on a rather old hardware, so
I cannot tell if this has been introduced by 15.5 or if the effect has been
uncovered by the much faster machine.

Restart failures only occur occasionally and infrequently, and in every single
case a following manual service restart worked without issues.

In every case of a restart failure, the journal looked the same (last excerpt
attached), it always failed on:

------------------------------------------------------------------------
nagios-exec-start-post … chown: Zugriff auf '/var/lib/nagios/status.dat' nicht
möglich: Datei oder Verzeichnis nicht gefunden
------------------------------------------------------------------------

…and that file then indeed wasn't present.

The script /usr/lib/nagios/nagios-exec-start-post does the chown only
immediately after checking and touching for the file missing:

------------------------------------------------------------------------
# set default access rights for files and directories
for file in "$log_file" "$state_retention_file" "$status_file"; do
    if [ ! -e "$file" ]; then
        touch "$file"
    fi
    chown --no-dereference ${nagios_user}:${nagios_cmdgrp} "$file"
done
------------------------------------------------------------------------

So there seems to be a race condition between the script and the already
running nagios main process, which seems to sometimes delete the status file
just before the script's chown.

I've tagged this with severity "major", because it causes our previously stable
monitoring service to fail and require manual intervention.

We're now testing service reload instead of restart, which seems to not have
that issue. Not sure if we can use reload for production though, as we had
issues with reload on complex nagios configuration changes in the past.

Regards,
Michael Balzer

Bug ID	1219667
Summary	Race condition preventing Nagios restart
Classification	openSUSE
Product	openSUSE Distribution
Version	Leap 15.5
Hardware	x86-64
OS	openSUSE Leap 15.5
Status	NEW
Severity	Major
Priority	P5 - None
Component	Other
Assignee	screening-team-bugs@suse.de
Reporter	technik@expeedo.de
QA Contact	qa-bugs@suse.de
Target Milestone	---
Found By	---
Blocker	---