See T12605. I'm going to sleep on this, but we should have a better technical plan to tackle these:
- API call rate limiting.
- Generic HTTP request rate limiting.
- SSH rate limiting.
See T12605. I'm going to sleep on this, but we should have a better technical plan to tackle these:
Restricted Differential Revision | Restricted Diffusion Commit | ||
Restricted Differential Revision | Restricted Diffusion Commit | ||
rP Phabricator | |||
D17758 | rP3698e4a14f6c Update rate limiting for APCu and X-Forwarded-For |
Status | Assigned | Task | ||
---|---|---|---|---|
Unknown Object (Maniphest Task) | ||||
Open | None | T12612 Design (and possibly implement) production rate limiting controls |
Here's what we currently have:
SystemAction: The PhabricatorSystemAction system provides security/abuse-related rate limiting: adding email addresses, triggering outbound HTTP requests, trying MFA. This system has been in production for a long time and I believe it works well.
Currently, all of the things it limits are low-volume -- on the order of 10 actions per hour. The underlying implementation is a table which adds one row for each action, then uses a sliding window to count recent actions.
This design isn't necessarily great for high-volume actions (e.g., a reasonable API rate limit might be 6000/hour) since we'll end up writing and reading one row per action. However, no part of it should fail catastrophically for this. Each action is also already allowed to have a point value other than "1", so we could let higher-volume actions do some bucketing like this:
if (the user took an action in the last minute) { add more points to that action; } otherwise { add a new action row; }
That would let us control the rows per user action with only a tiny increase in complexity if we run into scale issues with "one row per action".
I think this system is generally appropriate for limiting authenticated SSH and API calls, which I think can just have a global user limit (X/hour for each user).
(I mention SSH because we have one install with a user who seems to pretty much run git fetch in a loop on a big pile of repositories, and routinely has ~10 requests in flight simultaneously.)
I think a reasonable plan for API/SSH is to just use SystemAction to put reasonable rate limits in place, and then monitor it to see if we want to add the bucketing adjustment above.
Note that we've never (I think?) seen API abuse and no user would ever have hit a 6000/hour API rate limit. The only case of this that I can remember is that Slack once wrote a bot/integration which just polled Phriction really aggressively, but no individual user would have reached a ~6K/hour limit:
https://twitter.com/evanpriestley/status/604998874135068672
We've never seen real SSH abuse, either. That one guy could probably tune his stuff down a bit, but it isn't currently a major issue.
PhabricatorStartup::setMaximumRate(): This is raw HTTP request rate limiting based on remote address. The original intent was to deal with security scanners (e.g., HackerOne researchers running scanning tools against this install) and possibly crawlers.
We built a mechanism for this, but ultimately disabled it in D8736. Previously, see D8713 and T3923. I think we basically ended up just IP-banning everyone who was doing this and that solved the scanner problem and it was never revisited. From T3923, it looks like we continued to see some bad crawler behavior from Baidu that wasn't traceable to a single IP address.
This mechanism tries to act as early as possible in the stack -- before we load libraries or take other startup actions -- because the cost to the attacker per request is very low but the cost to us to pass through PhabricatorStartup::loadCoreLibraries() is comparatively high (roughly ~6ms), and even higher if we also pass through environment initialization (roughly ~20ms in the very best case). Here are the numbers on my local machine:
Reject Request After... | ab -c 25 -n 1000 90th percentile |
---|---|
preamble.php | 1ms |
PhabricatorStartup::loadCoreLibraries() | 9ms |
startup.done | 75ms |
In preamble.php, we basically just have APC: no libraries, no database connections.
After loadCoreLibraries(), we have libraries but no database connections.
After startup.done, we have database connections.
So if we can reject this traffic without libraries, we can handle something on the order of 75x more of it than if we need to set up a full environment with configuration and database connections.
I believe this system basically works fine, it just didn't really work well against threats we were actually seeing (specific bad actors were easier to deal with using IP bans at the time, and IP-based limiting didn't help with bad crawlers).
Another issue is that because this system only uses APC, the number of requests that user are allowed to make scales with the size of the web tier: each web node only knows about traffic that it has seen. But I think this is basically fine. It only gets us in trouble if we have 100 attackers each using 1% of the tier's resources (just enough to not trip the rate limiting) but if that traffic pattern is from a botnet DOS they can just make more requests, and if it's from an ambitious crawler we should be able to absorb it.
We could also combine the APC mechanism with a database mechanism to resist this kind of attack better: when we assign points for a request, we've loaded libraries and done database stuff. So we can add logic like:
if (a whole lot of nodes have seen requests from this client recently) { badness_points = badness_points * 100; }
Then each web node would ban them after a single request.
An unexplored but potentially reasonable adjustment is to assign points based on request duration (as a cheap proxy for the platonic ideal of "resource cost" / "request weight"). Some of the crawler badness we see is crawlers that ignore robots.txt getting stuck in Diffusion blaming every line of every file. These requests are comparatively expensive.
For Phacility, this system needs to deal with 'X-Forwarded-For'.
(We may also run into issues where an entire install is NAT'd? I think these remained theoretical at the time of T3923, but we'll probably run into them eventually, although I think this is less common than it was in the era of everyone in Qatar sharing a single AOL account.)
Path forward here is probably:
Then I think we run into these issues:
In PHI120, we hit an issue where an instance (whose members seem to be located in Asia) performed 1,600 file uploads at roughly 256KB/sec, apparently putting a substantial amount of pressure on workers for ~12 minutes.
Rate limiting would not necessarily prevent this, because:
We could change the way rate limiting works to also count opened and closed connections, maybe over a longer period of time. If you opened X more connections than you closed in a given period, we'd limit you based on open connections. This would tend to self-heal a little better than an explicit lock/slot approach. However, this is moot without fixing the startup sequence in T4369.