[Bug 1219667] New: Race condition preventing Nagios restart
https://bugzilla.suse.com/show_bug.cgi?id=1219667 Bug ID: 1219667 Summary: Race condition preventing Nagios restart Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.5 Hardware: x86-64 OS: openSUSE Leap 15.5 Status: NEW Severity: Major Priority: P5 - None Component: Other Assignee: screening-team-bugs@suse.de Reporter: technik@expeedo.de QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Created attachment 872527 --> https://bugzilla.suse.com/attachment.cgi?id=872527&action=edit Journal excerpt of nagios restart failure Our Nagios configuration gets updated automatically by our server management system. A cron job checks for updates and restarts Nagios accordingly. These automatic restarts began having issues after migrating our Nagios setup to Leap 15.5 on new hardware, from previously 15.4 on a rather old hardware, so I cannot tell if this has been introduced by 15.5 or if the effect has been uncovered by the much faster machine. Restart failures only occur occasionally and infrequently, and in every single case a following manual service restart worked without issues. In every case of a restart failure, the journal looked the same (last excerpt attached), it always failed on: ------------------------------------------------------------------------ nagios-exec-start-post … chown: Zugriff auf '/var/lib/nagios/status.dat' nicht möglich: Datei oder Verzeichnis nicht gefunden ------------------------------------------------------------------------ …and that file then indeed wasn't present. The script /usr/lib/nagios/nagios-exec-start-post does the chown only immediately after checking and touching for the file missing: ------------------------------------------------------------------------ # set default access rights for files and directories for file in "$log_file" "$state_retention_file" "$status_file"; do if [ ! -e "$file" ]; then touch "$file" fi chown --no-dereference ${nagios_user}:${nagios_cmdgrp} "$file" done ------------------------------------------------------------------------ So there seems to be a race condition between the script and the already running nagios main process, which seems to sometimes delete the status file just before the script's chown. I've tagged this with severity "major", because it causes our previously stable monitoring service to fail and require manual intervention. We're now testing service reload instead of restart, which seems to not have that issue. Not sure if we can use reload for production though, as we had issues with reload on complex nagios configuration changes in the past. Regards, Michael Balzer -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com