Page MenuHomePhabricator

Process slot exhaustion in Phacility web tier
Closed, ResolvedPublic

Description

An instance is repeatedly attempting to perform ~2,500 simultaneous file uploads. This is exhausting Apache process slots on the Phacility web tier and causing load balancers to fail health checks, leading to sites being inaccessible.

This problem is well understood, but applying an effective rate limit is difficult.

We've taken these steps to mitigate the issue until it can be fixed properly:

  • Added additional web tier capacity.
  • Temporarily disabled the "file.allocate" API endpoint.

We are pursuing a more durable fix based on rate limiting for this class of requests.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
rPHU libphutil
D18704
D18701
D18700
D18699
rP Phabricator
D18708
D18703
D18702

Event Timeline

epriestley added a revision: Restricted Differential Revision.Oct 10 2017, 6:58 PM
epriestley added a commit: Restricted Diffusion Commit.Oct 10 2017, 6:59 PM
epriestley added a revision: Restricted Differential Revision.Oct 10 2017, 7:09 PM
epriestley added a commit: Restricted Diffusion Commit.Oct 10 2017, 7:10 PM

There's also a bug with arc here. It's supposed to limit simultaneous uploads to 4:

$iterator = id(new FutureIterator($futures))->limit(4);

However, callMethod() calls isReady() before returning, starting the future immediately. So we actually execute with full parallelism -- and, worse, we only service 4 of the futures at once, which probably means the others take longer to resolve. I'll fix this.

I used ab to verify the rate limiting code here, and got killed after a bit of abuse with ab -c 10 -n 200 ...:

[ec2-user@db001e ~]$ curl -v https://secure.phabricator.com/
...
< HTTP/1.1 429 Too Many Requests
...
TOO MANY REQUESTS
You ("52.53.252.205") are issuing too many requests too quickly.
* Connection #0 to host secure.phabricator.com left intact

I used curl --limit-rate ... to test the connection limit stuff. This one was a little less conclusive -- I was able to trigger the TOO MANY CONNECTIONS limit and I didn't kill secure but the behavior isn't completely clean, but I'm not sure how much of that is nohup, curl, etc., vs our behavior. e.g.

for i in `seq 1 256`; do (nohup curl --limit-rate 64 -F "zero=@zero.txt" https://secure.phabricator.com/ 2>/dev/null &); done

Since I was doing this from a micro instance it also had a tendency to freeze the instance and OOM kill my shell.

But since I wasn't able to kill secure I think this at least mostly works in practice.

On the server side, the connection test looks better: secure001 saw 8 HTTP 500's (this might be a bug in the multipart/form-data parser for whatever cURL is doing) and then 135 HTTP 429s, which is more or less exactly what the code is supposed to do. So I think most of the inconclusiveness on what was happening from the client side was curl / nohup stuff.

(I also verified that real multipart/form-data works fine from a browser, by changing a project's profile picture here.)

It's currently possible for cluster hosts to hit rate limits against admin while doing normal deployment stuff, since they may make a lot of requests to admin very quickly.

There isn't really a great way to filter this traffic. I think I'm just going to whitelist all our external IPs for now, although this is pretty garbage. A cleaner way to do this would probably be to have an internal load balancer for internal traffic (e.g., admin-internal.phacility.com) and whitelist all traffic through it. (Or, if NAT in T12816 happens first, we could whitelist just the outward-facing NAT IP.)

We could/should also reduce how many calls normal deployment things need to make, but sometimes this is tricky. For example, bin/host destroy calls bin/storage destroy for each instance it's destroying, and that makes a call to get instance data. It's hard to pass the instance data down from the parent command without making this call.

This is working somewhat better than before, but I'm still able to wedge apache with enough requests using a pre-patch version of libphutil. I'm going to take a more detailed look at local behavior for this specific request pattern.

I'm going to take a more detailed look at local behavior for this specific request pattern.

I can't get this to wedge locally: apache completes these requests and closes the connections. I'm going to configure mod_status in production to try to get a clearer picture of what's going on.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 2:23 PM

I'm also unable to wedge secure. mod_status is working there, but since I can't wedge it it isn't very useful.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 2:36 PM

I was able to get web001 to wedge a lot of requests in "R" ("Reading Request") -- note "SS" is "seconds since beginning of most recent request".

Srv	PID	Acc	M	CPU	SS	Req	Conn	Child	Slot	Client	VHost	Request
...
2-0	31331	0/0/60	R	0.00	105	0	0.0	0.00	0.20	172.30.0.28		
3-0	31370	0/0/43	R	0.00	100	0	0.0	0.00	0.16	172.30.0.28		
4-0	31406	0/0/3	R	0.00	100	0	0.0	0.00	0.00	172.30.0.189		
5-0	31412	0/0/11	R	1.19	99	0	0.0	0.00	0.02	172.30.0.28		
6-0	27404	0/4/4	R	0.49	123	1574	0.0	0.01	0.01	172.30.0.189		
7-0	31121	0/2/5	R	0.20	106	226	0.0	0.00	0.01	172.30.0.189		
8-0	31413	0/0/2	R	0.00	99	0	0.0	0.00	0.00	172.30.0.28		
9-0	31123	0/3/4	R	0.45	99	1337	0.0	0.18	0.18	172.30.0.28		
10-0	31437	0/0/2	R	0.00	99	0	0.0	0.00	0.00	172.30.0.189		
11-0	27804	0/9/9	R	1.23	98	14	0.0	0.02	0.02	172.30.0.28		
12-0	27811	0/2/2	R	0.45	105	247	0.0	0.01	0.01	172.30.0.189		
13-0	27812	0/1/1	R	0.09	124	1121	0.0	0.00	0.00	172.30.0.28		
14-0	27882	0/5/5	R	0.69	98	100	0.0	0.01	0.01	172.30.0.28		
15-0	27893	0/7/7	R	1.11	115	217	0.0	0.01	0.01	172.30.0.189		
16-0	31445	0/0/2	R	0.10	97	0	0.0	0.00	0.00	172.30.0.28		
17-0	27948	0/1/1	R	0.22	118	1546	0.0	0.00	0.00	172.30.0.189		
18-0	27949	0/1/1	R	0.20	116	1275	0.0	0.00	0.00	172.30.0.28	
...

Only one request appears to have reached the access log.

This is pretty puzzling since secure and web should have the same setup and same versions of everything, and slb001 and lb001 have essentially the same configuration, but I don't see this behavior on secure.

It looks like we can use mod_reqtimeout, a builtin module, to tailor this behavior.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 3:04 PM
epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 3:10 PM

The behavior of mod_reqtimeout is weird in production. Some of this is the LB, but some of it I don't have a good explanation for.

Locally, if I telnet localhost 80 from my machine and don't type anything with a 4s timeout configured, I get a timeout after 4s.

If I telnet secure.phabricator.com 80 and don't type anything from my laptop, I get a timeout after about a minute (maybe at the LB).

If I telnet localhost 80from secure001 and don't type anything, I get a timeout after about 30s. This isn't hitting the LB so it doesn't make much sense to me. Best guess: the production version of mod_reqtimeout is a little different from my local version, and doesn't work on requests with no bytes.

If I repeat the tests and type a GET / HTTP/1.0\r\n header but no second newline, I get a timeout after about 4 seconds from the time I hit return if I'm starting from secure001. If I'm starting locally, it takes a minute to timeout. This is probably also the LB. The LB might also have different HTTP vs HTTPS rules/behaviors.

If I use openssl s_client -connect secure.phabricator.com:443 and send a header from secure001, I get a timeout after 60s. Locally, my openssl is busted because of some kind of Apple/SecureTransport/Homebrew mess so who knows.

To the degree that the LB is just holding connections, that's totally fine: they aren't taking apache process slots, and the LB should reasonably be able to handle a very large number of simultaneous connections. So I'm not too worried about this stuff as long as it seems like LB behavior, rather than Apache ignoring the directive. Apache is respecting the directive at least some of the time, so I'm going to send it to production.


Additionally, I'm not totally sure what I captured in mod_status is actually me. Something is making a lot of requests (~10K/requests/day?) that just time out:

$ grep 408 /var/log/apache2/access_log_vhost.log.1 | wc -l
44802

I think mod_reqtimeout will let us deal with this stuff a little more aggressively.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 3:39 PM

Something is making a lot of requests

The mod_reqtimeout change seems to have helped with these, at least. They're still happening, but getting kicked off after 4 seconds and not consuming 20+ process slots all the time.


I'm also no longer able to wedge the tier with a pre-patch arc upload <hundreds and hundreds of files>. I don't really have a complete picture of how these requests, general slot exhaustion from the ambient connection attempts, and mod_reqtimeout are interacting, but it seems like we're in better shape now.

epriestley lowered the priority of this task from Unbreak Now! to Normal.Oct 16 2017, 3:55 PM

I'm going to call this tentatively done and leave it open for a week or two to see if there's any fallout.

This made one thing meaningfully worse: the common preamble now has a hard-coded list of all cluster hosts. Not a huge deal, but something we should get rid of eventually.

This has been stable for about a week now.