Page MenuHomePhabricator

7b8d38cd causes hangs with SSH master connection
Closed, WontfixPublic

Description

7b8d38cd interacts poorly with OpenSSH's ControlMaster / ControlPersist configuration. These set up an SSH master connection automatically on the first SSH connection to a host, and multiplex future connections over it. However, due to a bug, the spawned background process inherits but does not close STDERR. With the change in 7b8d38cd, this now causes the first arc land of the day to hang, as it waits for the backgrounded process to close STDERR, despite the main SSH process having completed. The subsequent arc land attempt finds the still-running background master, and completes successfully.

OpenSSH 6.3 claimed to have fixed the bug, but failed to do so in actuality. While the bug is certainly the fault of the OpenSSH implementation, the effect it has with this change is quite poor. Perhaps a flag to ignore the EOF check on STDERR would be an acceptable work-around?

Event Timeline

The benefit of enabling ControlPersist is only very slightly faster SSH connections, right?

I'm not inclined to work around this issue in the Phabricator upstream since disabling the option until it is properly implemented in OpenSSH or commenting out the feof() check locally seem like reasonable workarounds.

We haven't seen other reports of this, and I'd guess users configuring ControlPersist are generally experts who can handily manage a small local patch. If we did add an arc land --i-am-using-controlpersist-on-a-buggy-openssh-but-trust-me-i-know-what-im-doing-and-dont-wait-for-stderr it could potentially mean --break-things-in-a-bad-mysterious-way in some contexts, and I suspect users who are sophisticated enough to distinguish between those cases can also slap some /* ... */ around the feof(). See also T8227.

If this is breaking everyone because you have some default configuration which enables it, maybe consider disabling the option globally until it works correctly?

Using SSH master connections has two advantages:

  1. Connections are indeed faster. In a minimal case (SSH'ing to an adjacent machine on the same network), it reduces the time for a no-op SSH connection from 0.14s to 0.01s. The utility of this grows as one needs to tunnel SSH connections in order to access a machine; for instance, SSH'ing to our Phabricator instance requires bouncing through a bastion host; using SSH master connections reduces the time to connect from ~0.7s to 0.01s. This is notable since arc may produce mulitiple git commands which attempt to fetch from upstream.
  2. Authentication information is preserved. If the connection does two-factor authentication, for instance, the user may be supplied with an authentication prompt when the master is set up. Re-using the master preserves this authenticated session, and reduces the amount of re-authorization that the user must do.

ControlPersist is not set in a centralized way, merely part of our suggested configuration, due to the above advantages. The suggestion was here was not to make arc take an additional option, but rather to make ExecFuture take one, such that only one callsite in arc need be patched.

SSH master connections are a fairly common configuration; I suspect that you've not seen other reports because we use closer to the bleeding edge than most folks. We've simply reverted 7b8d38cd locally, due to arc land being a much more common operation than git clone, and not having experienced the error that was addressing. However, I wanted to at least report the potential pitfall, in case others observed it.

epriestley claimed this task.

If you're open to patching one site, you can just comment out the feof($stderr) to retain the bulk of the change (i.e., still wait for stdout).

The critical path for 7b8d38cd is on the server, which presumably does not run arc land with ControlPersist, so as long as you don't revert it there you aren't exposing yourself to any known badness.

I'll keep an eye out for other reports of this. Definitely good to know about it.

Since only the first connection has an issue, you could also conceivably add a slightly-more-than-one-line patch to make arc land run some no-op remote operation (git remote show origin?) and then immediately kill the subprocess, creating and throwing away the bad connection so the followup doesn't hang.