Process slot exhaustion in Phacility web tier
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Oct 10 2017, 6:55 PM

Description

An instance is repeatedly attempting to perform ~2,500 simultaneous file uploads. This is exhausting Apache process slots on the Phacility web tier and causing load balancers to fail health checks, leading to sites being inaccessible.

This problem is well understood, but applying an effective rate limit is difficult.

We've taken these steps to mitigate the issue until it can be fixed properly:

Added additional web tier capacity.
Temporarily disabled the "file.allocate" API endpoint.

We are pursuing a more durable fix based on rate limiting for this class of requests.

Revisions and Commits

		Restricted Diffusion Commit
		Restricted Diffusion Commit
		Restricted Diffusion Commit
		Restricted Diffusion Commit
		Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
rPHU libphutil
	D18704	rPHUa6e490ef7ee0 Fix a bug which prevented Conduit futures from having parallelism limited…
	D18701	rPHU260dcbd6b093 Stream "multipart/form-data" file uploads to disk
	D18700	rPHUf46b445bdc5b Add a rough "multipart/form-data" stream parser
	D18699	rPHU68324a95db7e Add a rough HTTP header value parser
rP Phabricator
	D18708	rP819b833607a6 Tweak rate limiting point counts for omnipotent users
	D18703	rP3f53718d107f Modularize rate/connection limits in Phabricator
	D18702	rP9777c66576ff Allow Phabricator to run with "enable_post_data_reading" disabled

Related Objects

Mentioned In: D18962: When exporting more than 1,000 records, export in the background
Mentioned Here: T12816: Setup NAT for the primary Phacility cluster

Event Timeline

epriestley created this task.Oct 10 2017, 6:55 PM

Herald added a subscriber: eadler. · View Herald TranscriptOct 10 2017, 6:55 PM

epriestley added a revision: Restricted Differential Revision.Oct 10 2017, 6:58 PM

epriestley added a commit: Restricted Diffusion Commit.Oct 10 2017, 6:59 PM

epriestley updated the task description. (Show Details)Oct 10 2017, 7:02 PM

epriestley added a project: Ops.Oct 10 2017, 7:04 PM

epriestley added a revision: Restricted Differential Revision.Oct 10 2017, 7:09 PM

epriestley added a commit: Restricted Diffusion Commit.Oct 10 2017, 7:10 PM

epriestley added a revision: D18699: Add a rough HTTP header value parser.Oct 10 2017, 8:23 PM

epriestley added a revision: D18700: Add a rough "multipart/form-data" stream parser.Oct 10 2017, 8:36 PM

epriestley added a revision: D18701: Stream "multipart/form-data" file uploads to disk.Oct 10 2017, 10:17 PM

epriestley added a revision: D18702: Allow Phabricator to run with "enable_post_data_reading" disabled.Oct 10 2017, 10:26 PM

epriestley added a revision: D18703: Modularize rate/connection limits in Phabricator.Oct 11 2017, 11:39 PM

There's also a bug with arc here. It's supposed to limit simultaneous uploads to 4:

$iterator = id(new FutureIterator($futures))->limit(4);

However, callMethod() calls isReady() before returning, starting the future immediately. So we actually execute with full parallelism -- and, worse, we only service 4 of the futures at once, which probably means the others take longer to resolve. I'll fix this.

epriestley added a revision: D18704: Fix a bug which prevented Conduit futures from having parallelism limited effectively.Oct 12 2017, 7:18 PM

epriestley added a commit: rPHU68324a95db7e: Add a rough HTTP header value parser.Oct 12 2017, 7:34 PM

epriestley added a commit: rPHUf46b445bdc5b: Add a rough "multipart/form-data" stream parser.

epriestley added a commit: rPHU260dcbd6b093: Stream "multipart/form-data" file uploads to disk.Oct 12 2017, 7:36 PM

epriestley added a commit: rPHUa6e490ef7ee0: Fix a bug which prevented Conduit futures from having parallelism limited….

epriestley added a commit: rP9777c66576ff: Allow Phabricator to run with "enable_post_data_reading" disabled.Oct 13 2017, 8:11 PM

epriestley added a commit: rP3f53718d107f: Modularize rate/connection limits in Phabricator.

I used ab to verify the rate limiting code here, and got killed after a bit of abuse with ab -c 10 -n 200 ...:

[ec2-user@db001e ~]$ curl -v https://secure.phabricator.com/
...
< HTTP/1.1 429 Too Many Requests
...
TOO MANY REQUESTS
You ("52.53.252.205") are issuing too many requests too quickly.
* Connection #0 to host secure.phabricator.com left intact

I used curl --limit-rate ... to test the connection limit stuff. This one was a little less conclusive -- I was able to trigger the TOO MANY CONNECTIONS limit and I didn't kill secure but the behavior isn't completely clean, but I'm not sure how much of that is nohup, curl, etc., vs our behavior. e.g.

for i in `seq 1 256`; do (nohup curl --limit-rate 64 -F "zero=@zero.txt" https://secure.phabricator.com/ 2>/dev/null &); done

Since I was doing this from a micro instance it also had a tendency to freeze the instance and OOM kill my shell.

But since I wasn't able to kill secure I think this at least mostly works in practice.

On the server side, the connection test looks better: secure001 saw 8 HTTP 500's (this might be a bug in the multipart/form-data parser for whatever cURL is doing) and then 135 HTTP 429s, which is more or less exactly what the code is supposed to do. So I think most of the inconclusiveness on what was happening from the client side was curl / nohup stuff.

(I also verified that real multipart/form-data works fine from a browser, by changing a project's profile picture here.)

It's currently possible for cluster hosts to hit rate limits against admin while doing normal deployment stuff, since they may make a lot of requests to admin very quickly.

There isn't really a great way to filter this traffic. I think I'm just going to whitelist all our external IPs for now, although this is pretty garbage. A cleaner way to do this would probably be to have an internal load balancer for internal traffic (e.g., admin-internal.phacility.com) and whitelist all traffic through it. (Or, if NAT in T12816 happens first, we could whitelist just the outward-facing NAT IP.)

We could/should also reduce how many calls normal deployment things need to make, but sometimes this is tricky. For example, bin/host destroy calls bin/storage destroy for each instance it's destroying, and that makes a call to get instance data. It's hard to pass the instance data down from the parent command without making this call.

epriestley added a revision: D18708: Tweak rate limiting point counts for omnipotent users.Oct 14 2017, 3:00 PM

This is working somewhat better than before, but I'm still able to wedge apache with enough requests using a pre-patch version of libphutil. I'm going to take a more detailed look at local behavior for this specific request pattern.

epriestley added a commit: rP819b833607a6: Tweak rate limiting point counts for omnipotent users.Oct 16 2017, 1:43 PM

I'm going to take a more detailed look at local behavior for this specific request pattern.

I can't get this to wedge locally: apache completes these requests and closes the connections. I'm going to configure mod_status in production to try to get a clearer picture of what's going on.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 2:23 PM

I'm also unable to wedge secure. mod_status is working there, but since I can't wedge it it isn't very useful.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 2:36 PM

I was able to get web001 to wedge a lot of requests in "R" ("Reading Request") -- note "SS" is "seconds since beginning of most recent request".

Srv	PID	Acc	M	CPU	SS	Req	Conn	Child	Slot	Client	VHost	Request
...
2-0	31331	0/0/60	R	0.00	105	0	0.0	0.00	0.20	172.30.0.28		
3-0	31370	0/0/43	R	0.00	100	0	0.0	0.00	0.16	172.30.0.28		
4-0	31406	0/0/3	R	0.00	100	0	0.0	0.00	0.00	172.30.0.189		
5-0	31412	0/0/11	R	1.19	99	0	0.0	0.00	0.02	172.30.0.28		
6-0	27404	0/4/4	R	0.49	123	1574	0.0	0.01	0.01	172.30.0.189		
7-0	31121	0/2/5	R	0.20	106	226	0.0	0.00	0.01	172.30.0.189		
8-0	31413	0/0/2	R	0.00	99	0	0.0	0.00	0.00	172.30.0.28		
9-0	31123	0/3/4	R	0.45	99	1337	0.0	0.18	0.18	172.30.0.28		
10-0	31437	0/0/2	R	0.00	99	0	0.0	0.00	0.00	172.30.0.189		
11-0	27804	0/9/9	R	1.23	98	14	0.0	0.02	0.02	172.30.0.28		
12-0	27811	0/2/2	R	0.45	105	247	0.0	0.01	0.01	172.30.0.189		
13-0	27812	0/1/1	R	0.09	124	1121	0.0	0.00	0.00	172.30.0.28		
14-0	27882	0/5/5	R	0.69	98	100	0.0	0.01	0.01	172.30.0.28		
15-0	27893	0/7/7	R	1.11	115	217	0.0	0.01	0.01	172.30.0.189		
16-0	31445	0/0/2	R	0.10	97	0	0.0	0.00	0.00	172.30.0.28		
17-0	27948	0/1/1	R	0.22	118	1546	0.0	0.00	0.00	172.30.0.189		
18-0	27949	0/1/1	R	0.20	116	1275	0.0	0.00	0.00	172.30.0.28	
...

Only one request appears to have reached the access log.

This is pretty puzzling since secure and web should have the same setup and same versions of everything, and slb001 and lb001 have essentially the same configuration, but I don't see this behavior on secure.

It looks like we can use mod_reqtimeout, a builtin module, to tailor this behavior.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 3:04 PM

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 3:10 PM

The behavior of mod_reqtimeout is weird in production. Some of this is the LB, but some of it I don't have a good explanation for.

Locally, if I telnet localhost 80 from my machine and don't type anything with a 4s timeout configured, I get a timeout after 4s.

If I telnet secure.phabricator.com 80 and don't type anything from my laptop, I get a timeout after about a minute (maybe at the LB).

If I telnet localhost 80from secure001 and don't type anything, I get a timeout after about 30s. This isn't hitting the LB so it doesn't make much sense to me. Best guess: the production version of mod_reqtimeout is a little different from my local version, and doesn't work on requests with no bytes.

If I repeat the tests and type a GET / HTTP/1.0\r\n header but no second newline, I get a timeout after about 4 seconds from the time I hit return if I'm starting from secure001. If I'm starting locally, it takes a minute to timeout. This is probably also the LB. The LB might also have different HTTP vs HTTPS rules/behaviors.

If I use openssl s_client -connect secure.phabricator.com:443 and send a header from secure001, I get a timeout after 60s. Locally, my openssl is busted because of some kind of Apple/SecureTransport/Homebrew mess so who knows.

To the degree that the LB is just holding connections, that's totally fine: they aren't taking apache process slots, and the LB should reasonably be able to handle a very large number of simultaneous connections. So I'm not too worried about this stuff as long as it seems like LB behavior, rather than Apache ignoring the directive. Apache is respecting the directive at least some of the time, so I'm going to send it to production.

Additionally, I'm not totally sure what I captured in mod_status is actually me. Something is making a lot of requests (~10K/requests/day?) that just time out:

$ grep 408 /var/log/apache2/access_log_vhost.log.1 | wc -l
44802

I think mod_reqtimeout will let us deal with this stuff a little more aggressively.

epriestley added a commit: Restricted Diffusion Commit.Oct 16 2017, 3:39 PM

Something is making a lot of requests

The mod_reqtimeout change seems to have helped with these, at least. They're still happening, but getting kicked off after 4 seconds and not consuming 20+ process slots all the time.

I'm also no longer able to wedge the tier with a pre-patch arc upload <hundreds and hundreds of files>. I don't really have a complete picture of how these requests, general slot exhaustion from the ambient connection attempts, and mod_reqtimeout are interacting, but it seems like we're in better shape now.

I'm going to call this tentatively done and leave it open for a week or two to see if there's any fallout.

This made one thing meaningfully worse: the common preamble now has a hard-coded list of all cluster hosts. Not a huge deal, but something we should get rid of eventually.

This has been stable for about a week now.

epriestley mentioned this in D18962: When exporting more than 1,000 records, export in the background.Jan 29 2018, 10:43 PM

Process slot exhaustion in Phacility web tierClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Process slot exhaustion in Phacility web tier
Closed, ResolvedPublic
Actions