- At approximately 1:03PM, all 8 web hosts dropped out of the lb001 pool roughly simultaneously after rate limiting the ELB.
- I "fixed" this by restarting the hosts at roughly 1:15PM, but there's no particular reason to believe the ELB won't retrigger a rate limit.
Notes:
- Obviously, we shouldn't be subjecting ELB /status/ requests to rate limiting (but: they also shouldn't be getting anywhere close to the rate limit, which is something like hundreds of times higher than the request rate we expect from them).
- I think there are no meaningful rate-limiting changes on our side in about ~11 months (since D18705 / T13003) and the hosts haven't been touched since Saturday morning. Currently unclear what triggered this.
- Support queue reported this at 1:03PM, actual monitoring reported this at 1:09PM.
Plans:
- I'm going to dig through the logs and see if something weird happened (ELBs issuing an unusual set of requests?).
- I'll exempt ELB requests from rate limiting, probably by letting requests skip rate limiting if they have no X-Forwarded-For header. This should have the pleasant side effect of letting us drop the goofy hard-coded internal rate limiting IP list. This is a slightly involved change. I'll deploy this off-hours tonight if rate limiting does not re-trigger before then.