Page MenuHomePhabricator

Decrease pain and suffering caused by deploy/upgrade process
Open, Needs TriagePublic

Description

During deploys/upgrades, we put up an iptables rule to block incoming traffic, but then immediately go to work starting the upgrade. At least for slb001, our health check config is 5s timeout, 30s interval, 10x unhealthy threshold. Depending on whether or not the timeout is ignored on ERR_CONN_REFUSED, that gives us somewhere between 300s and 350s to mark an instance as unhealthy and stop directing traffic to it. During a (failed) deploy to secure, this was the resulting behavior:

Screen Shot 2017-04-20 at 1.03.47 PM.png (417×865 px, 61 KB)

Notice that the errors were counted as "Backend Connection Errors" instead of "Sum HTTP 5xxs" and average latency also spiked, so it's likely that we also had to wait for the 5s timeout to elapse for each healthcheck.

When I've built rolling deploys for apache instances behind ELB's in the past, I've done the following:

  1. Make the healthcheck settings as aggressive as possible (timeout 2s, interval 5s, unhealthy threshold 2x) to get an instance dropped from the LB in ~10 seconds.
  2. Change the healthcheck endpoint (currently served directly from apache instead of hitting PHP, which is also a little risky since a box with a bad PHP install will still pass the health check) to look for the presence of a magic temp file and return 5xx if it exists.
  3. Change the deploy to do touch <magic_file_path>, sleep(20), do_upgrade(), rm <magic_file_path>.

This has the advantage of working regardless of what LB we put in front of the web pool, as long as the LB does healthchecks and drops unhealthy hosts.

We would also need to add a --doitlive flag to the deploy script to skip the touch and sleep steps if we don't care and just need to push something out ASAP.

Alternatively/additionally, I'm pretty sure there's a reasonable way to ask apache to gracefully stop in a blocking way, so we don't proceed with the deploy as long as any requests are still in flight, but that risks blocking indefinitely and slowing down the deploy process (good if we're handling a huge file upload, bad if we're getting slowloris'd).

Event Timeline

There's also a discussion to be had about generally increasing the robustness of the healthcheck: that endpoint could attempt a DB connection, look for a full disk, check that ntp status is reasonable, etc, etc. This has tradeoffs as far as creating extra CPU load, false negatives, etc, etc, but there's probably a happy medium between "Apache doing RedirectMatch 200 "^/status/$"" and generating a full page.

I'm pretty sure there's a reasonable way to ask apache to gracefully stop in a blocking way

I don't know if this has changed, but this was the cutting-edge process I came up with that we used at Facebook for years, I believe:

  • Intentionally write an invalid configuration file so Apache can not restart.
  • Graceful restart it.
  • If it doesn't exit "soon", start murdering it.
  • Put the configuration file back where it was.
  • Start Apache.

It looks like there may now be a "graceful stop" command, though:

https://httpd.apache.org/docs/2.4/stopping.html#gracefulstop


Another possible question here is: how long does it take to pull a node out of an ELB with the AWS API? If it's very fast, maybe worthwhile to pursue actively managing nodes in the LB. But I'm guessing it's at least similar to the ~10s health check stuff.

On health-check, vanilla Phabricator runs this:

https://secure.phabricator.com/source/phabricator/browse/master/src/applications/system/controller/PhabricatorStatusController.php

Phabricator-on-Phacility fakes it here:

https://secure.phabricator.com/source/services/browse/master/src/config/PhacilitySiteSource.php;2a06867ba829461f4fcb369350dc3bb3f75cebac$25-30

I believe we intercept the request early because we won't have a valid "Host" header and all the "which instance are we supposed to be running?" logic would need to be made more complicated if it also needed to be able to handle status checks. However, the custom check could just call the same code as the upstream check if we wanted to do more complicated checks.

Offhand, I can't recall a case where a more complicated status check would have improved behavior, though. I don't think we've hit situations where a small fraction of the web pool goes bad for some reason.

Another possible question here is: how long does it take to pull a node out of an ELB with the AWS API? If it's very fast, maybe worthwhile to pursue actively managing nodes in the LB. But I'm guessing it's at least similar to the ~10s health check stuff.

I'm not a fan of this solution because

  • It's AWS-specific engineering effort that doesn't port to any other LB
  • There are lots of fun races where some auto-scaling operation creates/destroys servers after you've fetched the list of servers needing to be deployed
  • I've been burned by AWS API's timing out/using aggressive rate limits and breaking my scripts
  • Long and variable latency between API calls returning successfully and seeing the change in reality

At a previous gig, their crazy system (using haproxy as the LB) was:

  1. Add a special AWS tag to the instance getting deployed
  2. Via cron on the LB hosts, fetch the list of tagged instances and generate a new haproxy config
  3. SIGHUP haproxy to pick up the new config

This was a constant adventure in new and exciting failure modes. I like the "wait for LB to take us out of service" approach because it degrades gracefully: the deploy always takes the same amount of time, we try really hard not to kill in-flight requests or serve 5xx to new requests, but in the worst-case we kill a tiny number of in-flight requests while still completing the deploy.

Yeah, I don't like trying to mess with the LBs very much either. I recall the FB F5's (I think?) took like 75 seconds per API call and you needed to make 12 API calls to remove a host, I just figured AWS LBs might be some kind of special magic LBs that can add and remove elements from a list in about a second.

Waiting 10s also isn't too bad, but I think the F5's required you to wait 15 business days before they removed a host.

secure is hitting the magic override code in the second link since we deploy rSERVICES there too.

Oh, hrrm. Maybe that RedirectMatch is actually getting hit in practice by the LB, that's suuuper old. I get this from the host itself:

$ curl http://secure001.phacility.net/status/
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>200 OK</title>
</head><body>
<h1>OK</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>

...which is a little confusing but probably the RedirectMatch?

I think we can remove that rule to get PHP-driven /status/ behavior and that it's just vestigial from long ago.

Oh, maybe not. Things are kind of weird on secure. We should probably sort that out and get that hitting "real" /status/ somehow.

I think you might need to send a Host header to trigger the RedirectMatch... but the LB certainly isn't; it just knows IP/port/path. ¯\_(ツ)_/¯

I'll file a task for "unfuck status check".

This is also prrrrobably an awful idea, but in theory we could put, say, HAProxy on each host in front of Apache and have it queue connections while Apache was restarting. That feels like we're building a really flimsy house of cards, but it's something computers could technically do.

(We could also look at moving to nginx + php-fpm, which might let us do this in a less crazy way since php-fpm can be restarted separately and there's a clearer FCGI separation between them.)

On, and on this:

we put up an iptables rule to block incoming traffic

I think we configure iptables rules, but don't block incoming web traffic. The behavior is just "put iptables rules into the expected production state (no traffic on weird ports)", not "block traffic" followed by a later "unblock traffic".

This is also prrrrobably an awful idea, but in theory we could put, say, HAProxy on each host in front of Apache and have it queue connections while Apache was restarting. That feels like we're building a really flimsy house of cards, but it's something computers could technically do.

turtles-all-the-way-down-sam-hollingsworth.jpg (1×771 px, 132 KB)

Turtles aside, if we do that, we should go all-out and spin up a brand new Apache process listening on a new port, and have HAProxy seamlessly transition from the old process to the new one (and not terminate the old until post-deploy tire-kicking has concluded). And if we're going to go that far, we should remember we're already running in a virtualized environment and just spin up entirely new EC2 instances and kill the old ones when we're done.

A vaguely related question... how is the /status/ endpoint meant to work behind an ELB? AWS ELB's don't pass a Host header to the healthcheck endpoint, so by default it just returns a 500. I worked around this by using some nginx voodoo to rewrite the Host header if the User-Agent header matches ^ELB-HealthChecker/\d+\.\d+$, but this is less-than-ideal.