Page MenuHomePhabricator

Add a "Saved States" indirection layer on top of "Staging Areas"
Open, NormalPublic

Description

Phabricator currently uses "Staging Areas" to transfer code between arc on the client and Build/CI/Deployment systems. "Staging Areas" are temporary refs in some git repository. At time of writing, they are tags named refs/tags/phabricator/diff/123.

Some advantages of staging areas are:

  • they correctly transmit the entire client state without any additional work; and
  • third party build systems are generally ONLY able to build refs (and often not even refs; only tags/branches), so state must be represented as a ref for third-party build systems to work.

Some disadvantages of staging areas are:

  • They are cumbersome to configure, particularly if you have many repositories.
  • The client needs to push them. In large repositories, this is slow. It is also messy (we currently must push a base ref to trick Git into being moderately efficient).
  • They are visible to users with the git client, although users do not care about them directly. This is worse with a branch representation than a tag representation.
  • The lifetime of staging areas is unclear and GC'ing them isn't straightforward.

Historically, we shipped patches around instead. This is better on all these dimensions, but we normally can't send a patch to external CI, and our "ship patches around" infrastructure has some limitations in accurately transmitting changes.


See PHI1454. This is a large install reporting that staging areas are fundamentally slow in their monorepo and suggesting using git bundle instead.

See PHI1462. This is a large install reporting that staging areas are slow (here, the visible symptom is waiting for write locks).

See PHI1141. This is an install which would like a way to retrieve client state including merge ancestry.


Related upstream issues are:

  • I want to build arc save and arc load. arc load is just a fancy version of arc patch. arc save is a surgical version of "send this state somewhere", intended to replace "git push to save changes".
  • T13278 is approximately an earlier version of this task which focuses more narrowly on improving staging areas.
  • T8238 is an ancient version of this.

Here's the current plan I'm considering:

Introduce a new "Saved State" Object. This is visible in /differential/saves/ or something. You initially write this object with arc save and read it with arc load. If you want to "back up your work" or "share your work with someone", you arc save --as tmp123 and they arc load saves/epriestley/tmp123 or similar.

The actual object is a commit hash identifying what state it has saved, and a list of resources to build that state. These resources might be:

  • Pointer to a Ref: Means the state is already pushed somewhere, and the client should fetch it. So a "Saved State" can be built from a staging area or an existing published commit trivially.
  • List of base commits + patches: This is a formal way to say "check out master, then apply these patches". Here, the patches would reconstruct the original commits accurately (see T1508), not just the commit range. This is easiest to generate with git format-patch --stdout, although this may not be suitable for merge commits. git bundle is another possible alternative. This will have a long tail of mess, but nothing intractable.

From here:

  • Drydock learns to build working copies from saved states.
  • Drydock learns to turn a saved state into a staging area ref if downstream build systems can only work on branches/tags. This simplifies all the configuration/user-facing stuff.
  • Replace arc patch with arc load.

We can still do the ref virtualization stuff, and still support direct construction of staging areas from the client if installs don't want to deal with Drydock, although I'd imagine deprecating this eventually once Drydock is easier to use.

Event Timeline

epriestley triaged this task as Normal priority.Fri, Sep 27, 6:28 PM
epriestley created this task.

One broad problem here is "chain of custody" issues in T182. A "Saved State" can easily accommodate multiple representations, and the plan above imagines using Drydock to build tags/branches out of non-repository representations, so we'd have cases where a given "Saved State" has a way to build it with a "patch list" (from the client) or a "ref pointer" (from Drydock).

I can make my life a lot easier by supporting a "diff of the entire commit range" representation and just using that in Differential. However, this means a malicious client can upload one "patch list" change and a different "visible diff" change, get review on the safe "visible" change, then have Drydock land the malicious change. This problem exists today, but it's a problem I'd like to design around here, since my gut is that this is likely to be the final design for a very long time.

Ideally, when the client submits a list of patches, we'd just parse and merge them into a single aggregate diff for Differential to deal with. This is trivial if we have a staging area, but I'd like to do it interactively while the user is waiting on arc diff, so they can get a link to a working result immediately. This is less trivial. This doesn't need to be resolved immediately.

ftdysa added a subscriber: ftdysa.Sun, Sep 29, 3:34 PM