Page MenuHomePhabricator

Support local port forwarding through Phacility cluster bastion hosts
Closed, WontfixPublic

Description

See PHI1737. I'm running into an issue in production which I'm having difficulty reproducing locally. Roughly, submitting a particular form generates a CSRF exception for a recently-imported instance. This works fine on other instances and locally. Debugging the parts of the workflow which are easily reachable from the CLI hasn't yielded fruit.

Much of this workflow isn't easily inspectable from the CLI, but there's currently no way to run Phacility production in a debuggable context. This is mostly by design, but makes the tiny fraction of problems which are data-dependent and resist local reproduction harder to understand.

I'd like to provide a workflow to pull a reproduction case into a debuggable environment, like this:

  • the environment is some phantom web-debug host which is not in any LB pool;
  • the ports are glued together with ssh -L 80:web-debug001:80 via a bastion host;
  • then you can stop the local webserver, start the tunnel, poke your hosts file, and should be able to use an actual browser to review behavior and nano on the web-debug host to affect behavior.

Notes:

  • AllowTcpForwarding must allow local. Enabling this allows any user who can connect to a bastion forward through to any port on any cluster host, and effectively grants them permission to make outbound connections from the bastion to any host the bastion can reach. Today, this is fine (all users with access to the bastion are allowed to establish sessions on the bastion and initiate outbound connections) but in the future it might be appropriate to tie this permission to user role permissions more tightly. This can be accomplished by specifying options in AuthorizedKeys.
  • If AllowTcpForwarding prevents forwarding, the failure mode seems to be implicit (the host listens and accepts the connection but immediately resets it) rather than an explicit error like "You aren't allowed to forward ports because AllowTcpForwarding is off.".
Port Not Forwarded
$ curl http://127.0.0.1:17000/
curl: (7) Failed to connect to 127.0.0.1 port 17000: Connection refused
Port Forwarded, AllowTcpForwarding Disabled
$ curl http://127.0.0.1:17000/
curl: (56) Recv failure: Connection reset by peer
Port Forwarded, AllowTcpForwarding Enabled
$ curl 127.0.0.1:17000
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...

This is a little weird, but whatever?

  • There's no UI feedback from ssh that a connection is supporting tunnels. As a UI affordance to reduce surprises, bin/remote tunnel (or whatever) could perhaps exec sh -c 'echo "this is a tunnel" ; sleep 86400 ; exit' or similar to make the connection an explicit tunnel with a convenience timeout.
  • To forward local 80, we need to sudo ... -- ssh ..., and the messaging when you don't is a bit weird. This may also require an explicit --identity, since the default is affected by use of sudo. Phabricator could also be modified to respect HTTP port numbers arriving in Host: ... headers, but I'm not wildly excited about this.
  • The web-debug role should enable opcache.validate_timestamps.
  • The web-debug role should probably enable Phabricator application-level debug flags (darkconsole, etc).

HTTPS

Phacility web hosts run with security.require-https. Normally, HTTPS is terminated by the LB and preamble marks the request as HTTPS-on-the-client. When forwarding raw 80:80, the client is not HTTPS and the request is not marked as HTTPS (this request is still secure, since the external part is over SSH and the internal part is inside the VPC).

This creates a problem when Phabricator tries to figure out if it can set the secure flag on the cookie, and it refuses to set a non-HTTPS cookie with security.require-https enabled. The "real" fix here is probably to configure web-debug hosts in a special way that disables security.require-https.

Since I'm just using an extra web host as a web-debug host for now, I'm going to fake my way through this for the moment.

Event Timeline

epriestley created this task.

The specific issue I'm trying to debug is fairly bizarre.

If I make a request to production (LaptopLBweb001-8) I get back an invalid session and corresponding CSRF token, which fail to validate cleanly from the CLI on any host.

If I make a request to "web-debug" (really "web009") (LaptopSSH Bastion Tunnelweb009) I get a back good credentials which validate everywhere.

These environments are very nearly completely identical.

I think what's happened is that part of the CSRF algorithm is:

PhabricatorHash::digestWithNamedKey($phsid, 'csrf.alternate');
...
    $hash = PhabricatorHash::digestWithNamedKey(
      $secret.$time_block.$salt,
      'csrf');

digestWithNamedKey(...) uses PhabricatorCaches::getImmutableCache() since this value is normally immutable. However, it's definitely not immutable if the database has been restored underneath it, as happened here. So the explanation is that APCu has cached different hash secrets on some web tier instances, and the immediate solution is to cycle web.

In the longer term, either globally versioning all APCu cache entries or explicitly dirtying the cache would prevent this in the general case. Since we already have a logical clock for instance versioning, versioning might be easier, but some part of the import process needs to know it should increment the clock (currently, suspend/unsuspend should accomplish this).

So I'm going to turn web off and on again and see if that fixes things; my expectation is that it will.

So I'm going to turn web off and on again and see if that fixes things; my expectation is that it will.

Yes, it did. Blergh. "Two hard problems in computer science..."

epriestley claimed this task.

This isn't really resolved, but almost certainly does not make sense to pursue given the Phacility wind-down.

In this particular case, it probably would not have helped much, since the web-debug host would not have had the key in cache and would have worked. This might have been enough of a hint for me to figure out the issue, but that seems like a long shot.

I ultimately figured this out in a couple of hours anyway and this class of problem is extremely rare.