See PHI1514 and elsewhere. A pattern seen by some hosted installs (and, presumably, some installs in the wild) is that a Build/CI system (usually Jenkins) is configured to generate a fairly high pull request rate (>1 second). The cost to serve this sort of request load is currently relatively high.
The simplest fix is likely to make ssh-auth faster. Currently, ssh-auth loads the full environment and emits every SSH public key, similar to a .authorized_keys file. On a secure.phabricator.com host sitting at a 0.05 load average, this takes ~780ms to emit a 580KB response containing 929 keys. With current tooling it's hard to measure exactly how much load this is responsible for, particularly in production environments which aren't isolated, but it's obviously not helping anything. In clustered environments, this also happens twice (once on the receiving node, once on the repo node).
(This generally feels much slower than it "should" be -- loading 600KB of text from cache and emitting it should not take 800ms. Perhaps worth a closer look.)
Relatively recent versions of sshd support passing the public key to the ssh-auth command with %f. See T13123. There are some other supported patterns, too. A reasonable first step is to accept the public key with some optional --sshd-public-key %f sort of argument, then emit only the matching key line if we find a match. This should reduce the wire cost dramatically and the runtime cost somewhat.
To improve beyond that, we may need to turn the thing into a service process, which is a mess with current tooling, but not beyond the realm of possibility. Passing ssh-auth hits ssh-exec anyway, which must also pay the library cost.
It would be nice to start by coming up with some clearer measure of end-to-end costs, but this is broadly difficult. A reasonable proxy is perhaps git pull cost for phabricator/ against secure.phabricator.com with no changes to fetch, which currently takes between ~3,500ms and ~4,500ms.
My expectation is that this is approximately:
- ~800ms ssh-auth on the receiving host.
- ~800ms ssh-auth on the repository host.
- ~500ms (?) ssh-exec/library overhead on the repository host.
- ~1,500-2,500ms actual git stuff? Seems sort of higher than I'd expect, so maybe some other pieces are more expensive than I think.
- Maybe some SSH overhead in the middle we can fix with ControlMaster auto stuff?
Of these, the ControlMaster stuff is easiest to test, so maybe that's actually the better place to start.