[openFATE 305860] Improve Apache graceful restart (remove potential race condition)

15 Sep 2009

      Feature changed by: Thorsten Kukuk (kukuk)
Feature #305860, revision 8
  Title: Improve Apache graceful restart (remove potential race
  condition)

- openSUSE-11.2: Evaluation
+ openSUSE-11.2: Rejected by Thorsten Kukuk (kukuk)
+ reject date: 2009-09-15 13:14:36
+ reject reason: Run out of time.
  Priority
      Requester: Important

  Requested by: Peter Poeml (poeml)

Description:
  The Apache init script is capable of killing Apache with SIGWINCH which
  makes it close listen sockets and remove its pid file (in order to free
  all resources that would prevent the start of another Apache) but keeps
  running and serves ongoing requests for GracefulShutdownTimeout
  seconds. This allows a replacement of the daemon with minimal downtime.
  (Not to be confused with what's graceful restart, where the parent
  keeps running but reloads config and all modules.)
  Great feature, and generally works perfectly well with the following
  snippet from the init script:
   stop-graceful)
          echo "Shutting down httpd2 gracefully (SIGWINCH)"
          if ! [ -f $pidfile ]; then
                  echo -n "(not running)"
          else
                  pid=$(<$pidfile)
                  kill -WINCH $pid 2>/dev/null
                  case $? in
                      1)  echo -n "(not running)";;
                      0)  # wait until the pidfile is gone. The parent
  stays there, but closes the listen ports.
                          echo -n "(waiting for parent to close listen
  ports and remove pidfile) "
                          for ((wait=0; wait<120; wait++)); do
                                  if test -f $pidfile; then
                                          usleep 500000
                                          continue
                                  else
                                          break
                                  fi
                          done
                          ;;
                  esac
          fi

  However, in some setups there is something that makes Apache take time
  doing something that incur a delay between closing listen ports and
  removing pid file, so that the port is still in use even though the pid
  file is already gone.
  I believe this happens in setups where external processes are spawned
  (like mod_wsgi does or mod_fastcgi). That's where I have seen it. A
  sleep 1 before starting effectively works around it (but would not be a
  reliable fix). Maybe it makes sense to check for usage of listen ports,
  although those are variable per listen.conf.
  The shutdown order in server/mpm/worker/worker.c:ap_mpm_run() is
  correct but maybe one of the children isn't fast enough and it isn't
  waited for:
   /* Close our listeners, and then ask our children to do same */
          ap_close_listeners();
          ap_mpm_pod_killpg(pod, ap_daemons_limit, TRUE);
          ap_relieve_child_processes();

   if (!child_fatal) {
              /* cleanup pid file on normal shutdown */
              const char *pidfile = NULL;
              pidfile = ap_server_root_relative (pconf, ap_pid_fname);
              if ( pidfile != NULL && unlink(pidfile) == 0)
                  [...]
          }

  One of the children is indeed the /usr/sbin/fcgi-pm process in the
  setup where just encountered this again, and it has the listen port
  open.
  It could be an oversight to not simply close the listen ports when
  forking this process. It should probably only communicate over a unix
  domain socket with the other children. Indeed, killing the fcgi process
  just before killing Apache makes it restart fine.
  Still undecided whether this should be treated as an upstream bug,
  because the graceful shutdown feature can definitely not work under all
  circumstance, for resources that can't be freed as long as the old
  Apache is running. Anyway, it would be cool to use this extremely
  valuable feature in more cases.

  Relations:
  - Apache graceful restart race condition (novell/bugzilla/id: 475482)
  https://bugzilla.novell.com/show_bug.cgi?id=475482

-- 
openSUSE Feature: 
https://features.opensuse.org/305860

fate_noreply＠suse.de

tags

participants (1)