[openFATE 305860] Improve Apache graceful restart (remove potential race condition)
Feature changed by: Thorsten Kukuk (kukuk) Feature #305860, revision 8 Title: Improve Apache graceful restart (remove potential race condition) - openSUSE-11.2: Evaluation + openSUSE-11.2: Rejected by Thorsten Kukuk (kukuk) + reject date: 2009-09-15 13:14:36 + reject reason: Run out of time. Priority Requester: Important Requested by: Peter Poeml (poeml) Description: The Apache init script is capable of killing Apache with SIGWINCH which makes it close listen sockets and remove its pid file (in order to free all resources that would prevent the start of another Apache) but keeps running and serves ongoing requests for GracefulShutdownTimeout seconds. This allows a replacement of the daemon with minimal downtime. (Not to be confused with what's graceful restart, where the parent keeps running but reloads config and all modules.) Great feature, and generally works perfectly well with the following snippet from the init script: stop-graceful) echo "Shutting down httpd2 gracefully (SIGWINCH)" if ! [ -f $pidfile ]; then echo -n "(not running)" else pid=$(<$pidfile) kill -WINCH $pid 2>/dev/null case $? in 1) echo -n "(not running)";; 0) # wait until the pidfile is gone. The parent stays there, but closes the listen ports. echo -n "(waiting for parent to close listen ports and remove pidfile) " for ((wait=0; wait<120; wait++)); do if test -f $pidfile; then usleep 500000 continue else break fi done ;; esac fi However, in some setups there is something that makes Apache take time doing something that incur a delay between closing listen ports and removing pid file, so that the port is still in use even though the pid file is already gone. I believe this happens in setups where external processes are spawned (like mod_wsgi does or mod_fastcgi). That's where I have seen it. A sleep 1 before starting effectively works around it (but would not be a reliable fix). Maybe it makes sense to check for usage of listen ports, although those are variable per listen.conf. The shutdown order in server/mpm/worker/worker.c:ap_mpm_run() is correct but maybe one of the children isn't fast enough and it isn't waited for: /* Close our listeners, and then ask our children to do same */ ap_close_listeners(); ap_mpm_pod_killpg(pod, ap_daemons_limit, TRUE); ap_relieve_child_processes(); if (!child_fatal) { /* cleanup pid file on normal shutdown */ const char *pidfile = NULL; pidfile = ap_server_root_relative (pconf, ap_pid_fname); if ( pidfile != NULL && unlink(pidfile) == 0) [...] } One of the children is indeed the /usr/sbin/fcgi-pm process in the setup where just encountered this again, and it has the listen port open. It could be an oversight to not simply close the listen ports when forking this process. It should probably only communicate over a unix domain socket with the other children. Indeed, killing the fcgi process just before killing Apache makes it restart fine. Still undecided whether this should be treated as an upstream bug, because the graceful shutdown feature can definitely not work under all circumstance, for resources that can't be freed as long as the old Apache is running. Anyway, it would be cool to use this extremely valuable feature in more cases. Relations: - Apache graceful restart race condition (novell/bugzilla/id: 475482) https://bugzilla.novell.com/show_bug.cgi?id=475482 -- openSUSE Feature: https://features.opensuse.org/305860
participants (1)
-
fate_noreply@suse.de