Page MenuHomePhabricator

Improve performance of "ssh-auth"
Open, NormalPublic

Description

See PHI1514 and elsewhere. A pattern seen by some hosted installs (and, presumably, some installs in the wild) is that a Build/CI system (usually Jenkins) is configured to generate a fairly high pull request rate (>1 second). The cost to serve this sort of request load is currently relatively high.

The simplest fix is likely to make ssh-auth faster. Currently, ssh-auth loads the full environment and emits every SSH public key, similar to a .authorized_keys file. On a secure.phabricator.com host sitting at a 0.05 load average, this takes ~780ms to emit a 580KB response containing 929 keys. With current tooling it's hard to measure exactly how much load this is responsible for, particularly in production environments which aren't isolated, but it's obviously not helping anything. In clustered environments, this also happens twice (once on the receiving node, once on the repo node).

(This generally feels much slower than it "should" be -- loading 600KB of text from cache and emitting it should not take 800ms. Perhaps worth a closer look.)

Relatively recent versions of sshd support passing the public key to the ssh-auth command with %f. See T13123. There are some other supported patterns, too. A reasonable first step is to accept the public key with some optional --sshd-public-key %f sort of argument, then emit only the matching key line if we find a match. This should reduce the wire cost dramatically and the runtime cost somewhat.

To improve beyond that, we may need to turn the thing into a service process, which is a mess with current tooling, but not beyond the realm of possibility. Passing ssh-auth hits ssh-exec anyway, which must also pay the library cost.

It would be nice to start by coming up with some clearer measure of end-to-end costs, but this is broadly difficult. A reasonable proxy is perhaps git pull cost for phabricator/ against secure.phabricator.com with no changes to fetch, which currently takes between ~3,500ms and ~4,500ms.

My expectation is that this is approximately:

  • ~800ms ssh-auth on the receiving host.
  • ~800ms ssh-auth on the repository host.
  • ~500ms (?) ssh-exec/library overhead on the repository host.
  • ~1,500-2,500ms actual git stuff? Seems sort of higher than I'd expect, so maybe some other pieces are more expensive than I think.
  • Maybe some SSH overhead in the middle we can fix with ControlMaster auto stuff?

Of these, the ControlMaster stuff is easiest to test, so maybe that's actually the better place to start.

Event Timeline

epriestley triaged this task as Normal priority.Oct 28 2019, 11:00 PM
epriestley created this task.
epriestley added a commit: Restricted Diffusion Commit.Oct 28 2019, 11:37 PM

I configured ControlMaster on secure.

For kicks, I configured it locally too. And I swapped to git fetch origin master, which is slightly more reliable as a test case than git pull.

This is now taking less than a second pretty consistently, which is (a) completely ridiculous and (b) should be impossible?

I claim it "should be impossible" because we should need to ssh-auth twice (once on the receiving host, once on the repository host) and the cost of either invocation is approximately 800ms, so the absolute minimum possible cost should be ~1,600ms. But the actual cost is pretty consistently 700-800ms. sshd must be cheating by skipping ssh-auth. That's sort of okay, probably, although not great.

This also seems to have introduced sporadic hangs (???).

This at least suggests that the total cost of whatever ssh is doing plus ssh-auth is pretty high, on the order of 2,500-3,500ms.

Since the hangs are much worse than slow-ish get pull I'm going to back all of this stuff out (and ControlMaster vs non-ControlMaster connections make measuring things harder), but hopefully the impact here isn't because ssh is spending 1-2s doing absolutely nothing for no reason.

Elsewhere, we can bounce through bastion, run a PHP driver script, bounce to secure, run a command, and clean everything up in ~300ms:

$ time ssh -t -o 'LogLevel=ERROR' -o 'StrictHostKeyChecking=no' -o 'UserKnownHostsFile=/dev/null' bastion-external.phacility.net -- 'sudo -E -n -u ubuntu -- /core/bin/bastion ssh --host secure001.phacility.net -- ls'
README	bak  bin  conf	data  lib  log	repo  run  scripts  swap  tmp
Connection to secure001.phacility.net closed.
Connection to bastion-external.phacility.net closed.

real	0m0.304s
user	0m0.017s
sys	0m0.007s

...so I'm hopeful that's on the order of the real SSH overhead.

What a mess.

See also T10416, but I saw things hang during deploy and the OpenSSH bug should have been fixed years ago. Maybe this is just haunting, but ControlMaster feels like it's opening the door to a lot of very unfun problems.

epriestley added a comment.EditedOct 29 2019, 12:20 AM

I saw things hang during deploy and the OpenSSH bug should have been fixed years ago.

Oh, this is probably just our old production Ubuntu:

secure001 $ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.13, OpenSSL 1.0.1f 6 Jan 2014

The real fix was supposedly 7.3p1 (per https://bugzilla.mindrot.org/show_bug.cgi?id=1988#c12), so maybe an initial condition for ControlMaster being plausibly configurable in the cluster is an ssh newer than that.

epriestley added a commit: Restricted Diffusion Commit.Oct 29 2019, 1:10 AM

I deployed the --sshd-key %k stuff to secure and it looks like that shaved ~1,000-2000ms off the total cost. The ssh-auth cost has dropped to about 200ms:

$ time php /core/lib/phabricator/bin/ssh-auth secure --sshd-key AAAAB3NzaC1yc2EAAAADAQABAAACAQCv...
real	0m0.172s
user	0m0.131s
sys	0m0.040s

There's still a lot of variance between requests, but I can get time git fetch origin master to return in ~1,700ms somewhat consistently now. This is at least some kind of step forward.