During deploys/upgrades, we put up an iptables rule to block incoming traffic, but then immediately go to work starting the upgrade. At least for slb001, our health check config is 5s timeout, 30s interval, 10x unhealthy threshold. Depending on whether or not the timeout is ignored on ERR_CONN_REFUSED, that gives us somewhere between 300s and 350s to mark an instance as unhealthy and stop directing traffic to it. During a (failed) deploy to secure, this was the resulting behavior:
Notice that the errors were counted as "Backend Connection Errors" instead of "Sum HTTP 5xxs" and average latency also spiked, so it's likely that we also had to wait for the 5s timeout to elapse for each healthcheck.
When I've built rolling deploys for apache instances behind ELB's in the past, I've done the following:
- Make the healthcheck settings as aggressive as possible (timeout 2s, interval 5s, unhealthy threshold 2x) to get an instance dropped from the LB in ~10 seconds.
- Change the healthcheck endpoint (currently served directly from apache instead of hitting PHP, which is also a little risky since a box with a bad PHP install will still pass the health check) to look for the presence of a magic temp file and return 5xx if it exists.
- Change the deploy to do touch <magic_file_path>, sleep(20), do_upgrade(), rm <magic_file_path>.
This has the advantage of working regardless of what LB we put in front of the web pool, as long as the LB does healthchecks and drops unhealthy hosts.
We would also need to add a --doitlive flag to the deploy script to skip the touch and sleep steps if we don't care and just need to push something out ASAP.
Alternatively/additionally, I'm pretty sure there's a reasonable way to ask apache to gracefully stop in a blocking way, so we don't proceed with the deploy as long as any requests are still in flight, but that risks blocking indefinitely and slowing down the deploy process (good if we're handling a huge file upload, bad if we're getting slowloris'd).