We may be able to implement T8092 by proxying the protocol, without needing to embed an implementation of Git. We do this to some degree in Mercurial and SVN already, with success. Although this is complex, it's potentially much less complex than embedding a Git implementation.
Description
Revisions and Commits
rP Phabricator | |||
D20436 | rP904dbf0db666 Make the "git upload-pack" proxy more robust | ||
D20381 | rPe08ba99dd3db Proxy the "git upload-pack" wire protocol | ||
D20380 | rP35539a019ce1 Add an optional protocol log to `git` SSH workflows |
Status | Assigned | Task | ||
---|---|---|---|---|
Open | None | T8090 Allow Harbormaster to perform change handoff in a defensible way | ||
Open | None | T8092 Evaluate the viability of virtualizing Git refs in hosted repositories | ||
Open | None | T8093 Evaluate virtualizing Git refs by proxying the protocol | ||
Open | None | T4369 Phabricator HTTP repository hosting has fairly severe scalability limits |
Event Timeline
After tinkering a bit, I think this is viable. The Git wire protocol is relatively straightforward to proxy and rewrite at the ref level. However, we'll need to proxy both SSH and HTTP traffic, so we need to fix T4369 at a minimum before we can pursue this.
Very soon now, git is getting an exciting new wire protocol. Highlights are improving performance on repos with unholy amount of refs, and being "easier to expand".
exciting new wire protocol
My plan for now is to do v1 support only, since: (a) we'll need v1 for 15 years anyway for everyone running Ubuntu 3 on original Xbox hardware in their corporate enterprise cluster; and (b) I can't immediately tricky my git into v2 anyway; and (c) it looks easier.
The v1 protocol looks like it's pretty one-shot and straightforward: whether we're running upload-pack or receive-pack, the server immediately sends a complete list of refs to the client when the client connects. This is sort of a weird way for the protocol to work for 10+ years (?), also considering that this is the "smart" protocol, but it makes our job easier, since it looks like we can (as a starting point, at least) just parse the first few frames of the protocol, delete/rewrite some refs, and then drop into passthru mode.
This will just hide the refs from the client. A "malicious" client could still use want commands to fetch the underlying commits. However, this is fine: we aren't planning to treat different views of the same repository as having different permissions.
The want/need stuff seems ref-independent, so editing the initial list of refs looks like it fixes the whole read pathway with no other changes.
The "push" part is a little messier since the client sends what it's pushing, then sends PACK data, then the server acknowledges what was written. We need to parse all of that so we can rewrite refs in the first part (client thinks it's pushing A, tell the server it's pushing secret/A) and the last part (server acknowledges a write to secret/A, we tell the client the server acknowledge a write to A).
When there are no refs in a repository, the server does not appear to send a capabilities frame:
! git-upload-pack -- '/Users/epriestley/dev/core/repo/local/12/' < Write [4 bytes] < 30303030 0000 > Read [4 bytes] > 30303030 0000 _ <End of Session>
This makes our job a lot easier but also is absolutely bananas?