Page MenuHomePhabricator

All "web" host rate limited ELB `/status/` requests simultaneously after 11 months without configuration changes
Closed, ResolvedPublic

Description

  • At approximately 1:03PM, all 8 web hosts dropped out of the lb001 pool roughly simultaneously after rate limiting the ELB.
  • I "fixed" this by restarting the hosts at roughly 1:15PM, but there's no particular reason to believe the ELB won't retrigger a rate limit.

Notes:

  • Obviously, we shouldn't be subjecting ELB /status/ requests to rate limiting (but: they also shouldn't be getting anywhere close to the rate limit, which is something like hundreds of times higher than the request rate we expect from them).
  • I think there are no meaningful rate-limiting changes on our side in about ~11 months (since D18705 / T13003) and the hosts haven't been touched since Saturday morning. Currently unclear what triggered this.
  • Support queue reported this at 1:03PM, actual monitoring reported this at 1:09PM.

Plans:

  • I'm going to dig through the logs and see if something weird happened (ELBs issuing an unusual set of requests?).
  • I'll exempt ELB requests from rate limiting, probably by letting requests skip rate limiting if they have no X-Forwarded-For header. This should have the pleasant side effect of letting us drop the goofy hard-coded internal rate limiting IP list. This is a slightly involved change. I'll deploy this off-hours tonight if rate limiting does not re-trigger before then.

Event Timeline

epriestley created this task.

There are four rate limits, and I don't currently have enough information to figure out which one triggered. The rate limits are:

  • Forwarded address total rate (120 requests / minute / host).
  • Forwarded address open connections (16 simultaneous connections / host).
  • "Host" header request rate (120 requests / minute / host).
  • "Host" header total open connections (16 simultaneous connections / host).

These requests should not have a forwarded address or a Host header and should be exempt from both of these checks.

We also use RequestReadTimeout header=4-16,MinRate=4096 body=16,MinRate=16384 at the Apache level, but this should not affect /status/ checks and this returns a 408, not a 429, when it triggers.

The ELBs do seem to have an unusually high overall request rate. Here are requests to one host between 1:01PM and 1:02PM, when the rate limit triggered:

172.30.0.236 - - [10/Sep/2018:20:01:00 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:00 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:01 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:01 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:01 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:03 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:03 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:04 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:04 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:05 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:05 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:06 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:06 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:06 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:08 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:08 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:09 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:09 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:10 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:10 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:11 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:11 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:11 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:13 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:13 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:14 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:14 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:15 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:15 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:16 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:16 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:16 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:18 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:18 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:19 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:19 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:20 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:20 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:21 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:21 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:21 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:23 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:23 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:24 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:24 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:25 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:25 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:26 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:26 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:26 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:28 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:28 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:29 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:29 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:30 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:30 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:31 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:31 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:31 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:33 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:33 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:34 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.90 - - [10/Sep/2018:20:01:34 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.236 - - [10/Sep/2018:20:01:35 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.127 - - [10/Sep/2018:20:01:35 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.25 - - [10/Sep/2018:20:01:36 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.159 - - [10/Sep/2018:20:01:36 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.136 - - [10/Sep/2018:20:01:36 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.171 - - [10/Sep/2018:20:01:38 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.137 - - [10/Sep/2018:20:01:38 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 200 29
172.30.0.83 - - [10/Sep/2018:20:01:39 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.90 - - [10/Sep/2018:20:01:39 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.236 - - [10/Sep/2018:20:01:40 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.127 - - [10/Sep/2018:20:01:40 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.25 - - [10/Sep/2018:20:01:41 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.159 - - [10/Sep/2018:20:01:41 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.136 - - [10/Sep/2018:20:01:41 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.171 - - [10/Sep/2018:20:01:43 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.137 - - [10/Sep/2018:20:01:43 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.83 - - [10/Sep/2018:20:01:44 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.90 - - [10/Sep/2018:20:01:44 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.236 - - [10/Sep/2018:20:01:45 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.127 - - [10/Sep/2018:20:01:45 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.25 - - [10/Sep/2018:20:01:46 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.159 - - [10/Sep/2018:20:01:46 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.136 - - [10/Sep/2018:20:01:46 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.171 - - [10/Sep/2018:20:01:48 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.137 - - [10/Sep/2018:20:01:48 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.83 - - [10/Sep/2018:20:01:49 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.90 - - [10/Sep/2018:20:01:49 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.236 - - [10/Sep/2018:20:01:50 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.127 - - [10/Sep/2018:20:01:50 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.25 - - [10/Sep/2018:20:01:51 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.159 - - [10/Sep/2018:20:01:51 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.136 - - [10/Sep/2018:20:01:51 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.171 - - [10/Sep/2018:20:01:53 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.137 - - [10/Sep/2018:20:01:53 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.83 - - [10/Sep/2018:20:01:54 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.90 - - [10/Sep/2018:20:01:54 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.236 - - [10/Sep/2018:20:01:55 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.127 - - [10/Sep/2018:20:01:55 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.25 - - [10/Sep/2018:20:01:56 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.159 - - [10/Sep/2018:20:01:56 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.136 - - [10/Sep/2018:20:01:56 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.171 - - [10/Sep/2018:20:01:58 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.137 - - [10/Sep/2018:20:01:58 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.83 - - [10/Sep/2018:20:01:59 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81
172.30.0.90 - - [10/Sep/2018:20:01:59 +0000] [172.30.0.14] "GET /status/ HTTP/1.1" 429 81

So the ELBs made 108 requests in the minute leading up to the rate limit being triggered. The configured interval is 5 seconds (i.e., 12 requests / minute), so this is roughly 9x higher than I'd naively expect.

I'll exempt ELB requests from rate limiting, probably by letting requests skip rate limiting if they have no X-Forwarded-For header.

This is already the (intended) behavior so it's also possible something has changed here. I'm going to look at the practical behavior of preamble_get_forwarded_address() and preamble_get_host() in production today as a next step.

This should have the pleasant side effect of letting us drop the goofy hard-coded internal rate limiting IP list.

(This is also not accurate -- the issue with this list is we currently go out and then back in through the internet-facing LB, so we expect to pick up an X-Forwarded-For and have to continue dealing with this nonsense until we get an internal-facing LB in place.)

... in production today as a next step.

I'm putting this into common.preamble.php in production on web008 now via local edit:

if ($path == '/status/') {
  error_log(
    print_r(
      array(
        'f' => $forwarded_address,
        'h' => preamble_get_host(),
      ), true));
}
...
[Mon Sep 10 20:48:43.928021 2018] [:error] [pid 21570] [client 172.30.0.171:16516] Array\n(\n    [f] => \n    [h] => 172.30.0.60\n)\n
...

So the X-Forwarded-For is blank, but the ELB is setting the Host header to the server IP address. This seems like reasonable behavior from the ELB.

I'm going to change the Host header limit to require that X-Forwarded-For also be present.

D19653 (above) changes the per-"Host" rate limit to require "X-Forwarded-For" be present in the request. This should exempt ELB requests from these limits.

Here's the request rate leading up to the rate limiting:

$ tail -n 10000 /var/log/apache2/access_log_vhost.log.1 | grep /status/ | awk '{print $4 " " $10}' | cut -c14-18 | sort | uniq -c
...
     96 19:52
     96 19:53
     96 19:54
     96 19:55
     96 19:56
     96 19:57
    106 19:58
    108 19:59
    108 20:00
    107 20:01 <<< Rate Limit Activated
    107 20:02
    108 20:03
    108 20:04
    107 20:05
     84 20:06
     84 20:07
     84 20:08
     84 20:09
     84 20:10
     84 20:11
     84 20:12
     84 20:13
     43 20:14 <<< Host Restarted

My speculation is:

  • ELBs are implemented as a number of virtual machines.
  • Each ELB machine instance does its own health checks.
  • Until 12:57 PM, the ELB had 8 instances. Each instance health-checked every 5 seconds, i.e. 12 requests/minute. 8 hosts * 12 requests = 96 requests/minute.
  • At 12:58 PM, the ELB automatically scaled up to a 9th instance. This pushed us to an expected request rate of 9 hosts * 12 requests = 108 requests/minute.
  • Not shown above is that each ELB also makes an OPTIONS request once per minute, so the actual expected rate is (9 hosts * 12 health checks/minute) + (9 hosts * 1 options check/minute) = 117 requests/minute.
  • This triggered the 120 requests/minute rate limit (e.g., a few requests ended up in a different window and peaked over 120 in a 60-second window).

And the ELBs have been subject to this limit since we put it in place, we just presumably haven't autoscaled to 9 ELB instances before. Offhand, it looks like we've roughly doubled traffic in the last 11 months so this seems reasonable.

This means we're immediately at risk for the ELBs to get rate limited again, but it's end-of-business in Eastern time now and we're on the down-slope of total request rate, so I'm going to guess that we'll be okay and wait ~4 hours (until after EOB in PST) to deploy D19653 since it will be slightly disruptive.

I've issued all instances a 24-hour service credit for the disruption. This should be reflected on your next invoice.

Deploying the changes to web now.

epriestley claimed this task.

This is now live.