Proxy the "git upload-pack" wire protocol
ClosedPublic
Actions

Authored by epriestley on Apr 8 2019, 4:01 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T8093: Evaluate virtualizing Git refs by proxying the protocol

Commits

rPe08ba99dd3db: Proxy the "git upload-pack" wire protocol

Summary

Depends on D20380. Ref T8093. When prototypes are enabled, inject a (hopefully?) no-op proxy into the Git wire protocol.

This proxy decodes "git upload-pack" and allows the list of references to be rewritten, in a similar way to how we already proxy the Subversion protocol to rewrite URIs and proxy the Mercurial protocol to distinguish between read and write operations.

The piece we care about comes at the beginning, and looks like this:

<frame-length><ref-hash> <ref-name>\0<server-capabilities>\n
<frame-length><ref-hash> <ref-name>\n
<frame-length><ref-hash> <ref-name>\n
...
<0000>

We can add, remove, or modify this section to make it appear that the server has different refs than the refs that exist on disk.

Things I have tried:

git ls-remote
git ls-remote where the server hides some refs.
git fetch where the fetch is a no-op.

Things I have not tried:

git fetch where the fetch is not a no-op.
Tricking things into doing protocol v2. Or: I tried this, I wasn't successful. In v2, additional "\0" tricks are used to hide data in the capabilities, I think?
git ls-remote where we rewrite/hide the first ref in the list, and need to move the capabilities frame elsewhere.
git ls-remote where the server has no refs at all, or we remove every ref.

So the "interesting" piece of this works, but it almost certainly needs some cleanup to survive interaction with the real world.

Test Plan

See above.

Diff Detail

Repository

rP Phabricator

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.Apr 8 2019, 4:01 PM

Harbormaster completed remote builds in B22501: Diff 48626.Apr 8 2019, 4:02 PM

epriestley requested review of this revision.Apr 8 2019, 4:02 PM

We probably can't really run this "alongside" the current production git endpoint without doing a ton of work, can we? It would be nice to watch this thing have byte-identical output to prod for a while before cutting over.

This revision is now accepted and ready to land.Apr 9 2019, 7:38 PM

We could.

src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php
125	We can run them simultaneously and check for byte-equivalence by changing this into: $temporary_new_message = $protocol->willReadBytes($message); $this->oldBytes .= $message; $this->newBytes .= $temporary_new_message; // ... later ... assert($this->oldBytes === $this->newBytes); This only gets us into trouble if someone enables prototypes and does a 2GB+ fetch and we hit the string length limit, and we could compare the outputs as we go to avoid that. But I'm not sure this is particularly valuable: we don't have a way to collect external telemetry (and even if we did, having a protocol log but no actual repro case probably wouldn't be terribly helpful in many cases), and requests against `secure` don't touch any of the interesting cases anyway so it seems unlikely that they'd identify issues that don't crop up locally. This does run "alongside" in the sense that the new code only activates when prototypes are enabled, so `secure` will see it but production will continue just doing protocol passthru until this feels ready to move forward.

We could.

Oh, I didn't really expand on this -- my rough plan is:

Probably don't touch this this week.
Before this lands, test all those edge cases and probably get the code into better overall shape -- more structure to the "frames/messages" part, probably unit tests since this stuff is super easy to unit test by dumping a text file with a protocol transcript into a data/ directory and making sure it parses into the expected set of frames.
Land it in "prototype-only", without any rewriting happening. Once that's solid, promote the no-rewrite code out of prototype.
Land the rewriting in "prototype-only", then promote it later.
Somewhere in here, "git fetch-pack" needs a similar treatment.

So the rollout can be fairly slow/limited based on how solid the code feels, I'm just not planning for a "double reads" stage where we're passing the bytes through directly but also parsing them and then logging when the messages don't match, since I think that's hard for us to implement and unlikely to catch anything that the more accessible approach doesn't catch.

epriestley added a child revision: D20436: Make the "git upload-pack" proxy more robust.Apr 16 2019, 4:23 PM

Closed by commit rPe08ba99dd3db: Proxy the "git upload-pack" wire protocol (authored by epriestley). · Explain WhyApr 18 2019, 11:57 AM

This revision was automatically updated to reflect the committed changes.

epriestley mentioned this in D21484: Fix isValidGitShallowCloneResponse.Oct 30 2020, 8:32 PM