Design (and possibly implement) production rate limiting controls
Open, NormalPublic
Actions

Assigned To

None

Authored By

	epriestley
	Apr 20 2017, 9:22 PM

Description

See T12605. I'm going to sleep on this, but we should have a better technical plan to tackle these:

API call rate limiting.
Generic HTTP request rate limiting.
SSH rate limiting.

Revisions and Commits

	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
rP Phabricator
	D17758	rP3698e4a14f6c Update rate limiting for APCu and X-Forwarded-For

Related Objects
Search...

		Status	Assigned	Task
				Unknown Object (Maniphest Task)
		Open	None	T12612 Design (and possibly implement) production rate limiting controls

Event Timeline

epriestley created this task.Apr 20 2017, 9:22 PM

epriestley created this object in space Restricted Space.

epriestley created this object with edit policy "All Users".

I'm going to kick this out of the ops space since none of this should be sensitive.

epriestley shifted this object from the Restricted Space space to the S1 Core space.Apr 21 2017, 12:49 PM

Herald added a subscriber: eadler. · View Herald TranscriptApr 21 2017, 12:49 PM

epriestley lowered the priority of this task from High to Normal.Apr 21 2017, 12:49 PM

Here's what we currently have:

SystemAction: The PhabricatorSystemAction system provides security/abuse-related rate limiting: adding email addresses, triggering outbound HTTP requests, trying MFA. This system has been in production for a long time and I believe it works well.

Currently, all of the things it limits are low-volume -- on the order of 10 actions per hour. The underlying implementation is a table which adds one row for each action, then uses a sliding window to count recent actions.

This design isn't necessarily great for high-volume actions (e.g., a reasonable API rate limit might be 6000/hour) since we'll end up writing and reading one row per action. However, no part of it should fail catastrophically for this. Each action is also already allowed to have a point value other than "1", so we could let higher-volume actions do some bucketing like this:

if (the user took an action in the last minute) {
  add more points to that action;
} otherwise {
  add a new action row;
}

That would let us control the rows per user action with only a tiny increase in complexity if we run into scale issues with "one row per action".

I think this system is generally appropriate for limiting authenticated SSH and API calls, which I think can just have a global user limit (X/hour for each user).

(I mention SSH because we have one install with a user who seems to pretty much run git fetch in a loop on a big pile of repositories, and routinely has ~10 requests in flight simultaneously.)

I think a reasonable plan for API/SSH is to just use SystemAction to put reasonable rate limits in place, and then monitor it to see if we want to add the bucketing adjustment above.

Note that we've never (I think?) seen API abuse and no user would ever have hit a 6000/hour API rate limit. The only case of this that I can remember is that Slack once wrote a bot/integration which just polled Phriction really aggressively, but no individual user would have reached a ~6K/hour limit:

https://twitter.com/evanpriestley/status/604998874135068672

We've never seen real SSH abuse, either. That one guy could probably tune his stuff down a bit, but it isn't currently a major issue.

PhabricatorStartup::setMaximumRate(): This is raw HTTP request rate limiting based on remote address. The original intent was to deal with security scanners (e.g., HackerOne researchers running scanning tools against this install) and possibly crawlers.

We built a mechanism for this, but ultimately disabled it in D8736. Previously, see D8713 and T3923. I think we basically ended up just IP-banning everyone who was doing this and that solved the scanner problem and it was never revisited. From T3923, it looks like we continued to see some bad crawler behavior from Baidu that wasn't traceable to a single IP address.

This mechanism tries to act as early as possible in the stack -- before we load libraries or take other startup actions -- because the cost to the attacker per request is very low but the cost to us to pass through PhabricatorStartup::loadCoreLibraries() is comparatively high (roughly ~6ms), and even higher if we also pass through environment initialization (roughly ~20ms in the very best case). Here are the numbers on my local machine:

Reject Request After...	`ab -c 25 -n 1000` 90th percentile
`preamble.php`	1ms
`PhabricatorStartup::loadCoreLibraries()`	9ms
`startup.done`	75ms

In preamble.php, we basically just have APC: no libraries, no database connections.
After loadCoreLibraries(), we have libraries but no database connections.
After startup.done, we have database connections.

So if we can reject this traffic without libraries, we can handle something on the order of 75x more of it than if we need to set up a full environment with configuration and database connections.

I believe this system basically works fine, it just didn't really work well against threats we were actually seeing (specific bad actors were easier to deal with using IP bans at the time, and IP-based limiting didn't help with bad crawlers).

Another issue is that because this system only uses APC, the number of requests that user are allowed to make scales with the size of the web tier: each web node only knows about traffic that it has seen. But I think this is basically fine. It only gets us in trouble if we have 100 attackers each using 1% of the tier's resources (just enough to not trip the rate limiting) but if that traffic pattern is from a botnet DOS they can just make more requests, and if it's from an ambitious crawler we should be able to absorb it.

We could also combine the APC mechanism with a database mechanism to resist this kind of attack better: when we assign points for a request, we've loaded libraries and done database stuff. So we can add logic like:

if (a whole lot of nodes have seen requests from this client recently) {
  badness_points = badness_points * 100;
}

Then each web node would ban them after a single request.

An unexplored but potentially reasonable adjustment is to assign points based on request duration (as a cheap proxy for the platonic ideal of "resource cost" / "request weight"). Some of the crawler badness we see is crawlers that ignore robots.txt getting stuck in Diffusion blaming every line of every file. These requests are comparatively expensive.

For Phacility, this system needs to deal with 'X-Forwarded-For'.

(We may also run into issues where an entire install is NAT'd? I think these remained theoretical at the time of T3923, but we'll probably run into them eventually, although I think this is less common than it was in the era of everyone in Qatar sharing a single AOL account.)

Path forward here is probably:

Deploy PhabricatorStartup::setMaximumRate() for secure.phabricator.com.
Keep an eye on it for a bit.
Once it feels stable, put it into production in the primary cluster.
Deploy PhabricatorSystemAction for API calls.
Deploy PhabricatorSystemAction for SSH calls.

Then I think we run into these issues:

User-facing config for system action rate limits? (Do we need per-user, etc? For installs that want to limit random engineers but give build agents a higher limit?)
Raw SSH calls don't hit a pre-auth/pre-library/pre-db rate limit -- tie them into PhabricatorStartup if we hit issues?
In the cluster, per-instance stuff if we have one instance that's NAT'd and needs a higher limit? Just HTTP_HOST them in preamble.php for now?

epriestley added a revision: D17758: Update rate limiting for APCu and X-Forwarded-For.Apr 21 2017, 8:59 PM

epriestley added a revision: Restricted Differential Revision.Apr 21 2017, 9:20 PM

epriestley added a revision: Restricted Differential Revision.Apr 21 2017, 9:24 PM

epriestley added a commit: rP3698e4a14f6c: Update rate limiting for APCu and X-Forwarded-For.Apr 22 2017, 3:39 AM

epriestley added a commit: Restricted Diffusion Commit.Apr 22 2017, 4:30 PM

epriestley added a commit: Restricted Diffusion Commit.

In PHI120, we hit an issue where an instance (whose members seem to be located in Asia) performed 1,600 file uploads at roughly 256KB/sec, apparently putting a substantial amount of pressure on workers for ~12 minutes.

Rate limiting would not necessarily prevent this, because:

The current default rate limit for logged-in users kicks in when a user exceeds 150 requests per minute over a period of 5 minutes. However, each server only saw ~33 requests / minute in this case, so it was 5x lower than the limit.
Even if the limit was different, these requests were a problem because of their long duration, not because of a high request rate. Current rate limiting is primarily trying to defend against high request rates.
Rate limiting wouldn't activate until POST data had been read, until we make the changes to run application code before reading POST data described in T4369.

We could change the way rate limiting works to also count opened and closed connections, maybe over a longer period of time. If you opened X more connections than you closed in a given period, we'd limit you based on open connections. This would tend to self-heal a little better than an explicit lock/slot approach. However, this is moot without fixing the startup sequence in T4369.

• pasik added a subscriber: • pasik.May 12 2018, 2:11 PM

Design (and possibly implement) production rate limiting controlsOpen, NormalPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Design (and possibly implement) production rate limiting controls
Open, NormalPublic
Actions

Related Objects
Search...