Page MenuHomePhabricator

Add a "Saved States" indirection layer on top of "Staging Areas"
Open, NormalPublic

Description

Phabricator currently uses "Staging Areas" to transfer code between arc on the client and Build/CI/Deployment systems. "Staging Areas" are temporary refs in some git repository. At time of writing, they are tags named refs/tags/phabricator/diff/123.

Some advantages of staging areas are:

  • they correctly transmit the entire client state without any additional work; and
  • third party build systems are generally ONLY able to build refs (and often not even refs; only tags/branches), so state must be represented as a ref for third-party build systems to work.

Some disadvantages of staging areas are:

  • They are cumbersome to configure, particularly if you have many repositories.
  • The client needs to push them. In large repositories, this is slow. It is also messy (we currently must push a base ref to trick Git into being moderately efficient).
  • They are visible to users with the git client, although users do not care about them directly. This is worse with a branch representation than a tag representation.
  • The lifetime of staging areas is unclear and GC'ing them isn't straightforward.

Historically, we shipped patches around instead. This is better on all these dimensions, but we normally can't send a patch to external CI, and our "ship patches around" infrastructure has some limitations in accurately transmitting changes.


See PHI1454. This is a large install reporting that staging areas are fundamentally slow in their monorepo and suggesting using git bundle instead.

See PHI1462. This is a large install reporting that staging areas are slow (here, the visible symptom is waiting for write locks).

See PHI1141. This is an install which would like a way to retrieve client state including merge ancestry.


Related upstream issues are:

  • I want to build arc save and arc load. arc load is just a fancy version of arc patch. arc save is a surgical version of "send this state somewhere", intended to replace "git push to save changes".
  • T13278 is approximately an earlier version of this task which focuses more narrowly on improving staging areas.
  • T8238 is an ancient version of this.

Here's the current plan I'm considering:

Introduce a new "Saved State" Object. This is visible in /differential/saves/ or something. You initially write this object with arc save and read it with arc load. If you want to "back up your work" or "share your work with someone", you arc save --as tmp123 and they arc load saves/epriestley/tmp123 or similar.

The actual object is a commit hash identifying what state it has saved, and a list of resources to build that state. These resources might be:

  • Pointer to a Ref: Means the state is already pushed somewhere, and the client should fetch it. So a "Saved State" can be built from a staging area or an existing published commit trivially.
  • List of base commits + patches: This is a formal way to say "check out master, then apply these patches". Here, the patches would reconstruct the original commits accurately (see T1508), not just the commit range. This is easiest to generate with git format-patch --stdout, although this may not be suitable for merge commits. git bundle is another possible alternative. This will have a long tail of mess, but nothing intractable.

From here:

  • Drydock learns to build working copies from saved states.
  • Drydock learns to turn a saved state into a staging area ref if downstream build systems can only work on branches/tags. This simplifies all the configuration/user-facing stuff.
  • Replace arc patch with arc load.

We can still do the ref virtualization stuff, and still support direct construction of staging areas from the client if installs don't want to deal with Drydock, although I'd imagine deprecating this eventually once Drydock is easier to use.

Event Timeline

epriestley triaged this task as Normal priority.Sep 27 2019, 6:28 PM
epriestley created this task.

One broad problem here is "chain of custody" issues in T182. A "Saved State" can easily accommodate multiple representations, and the plan above imagines using Drydock to build tags/branches out of non-repository representations, so we'd have cases where a given "Saved State" has a way to build it with a "patch list" (from the client) or a "ref pointer" (from Drydock).

I can make my life a lot easier by supporting a "diff of the entire commit range" representation and just using that in Differential. However, this means a malicious client can upload one "patch list" change and a different "visible diff" change, get review on the safe "visible" change, then have Drydock land the malicious change. This problem exists today, but it's a problem I'd like to design around here, since my gut is that this is likely to be the final design for a very long time.

Ideally, when the client submits a list of patches, we'd just parse and merge them into a single aggregate diff for Differential to deal with. This is trivial if we have a staging area, but I'd like to do it interactively while the user is waiting on arc diff, so they can get a link to a working result immediately. This is less trivial. This doesn't need to be resolved immediately.

ftdysa added a subscriber: ftdysa.Sep 29 2019, 3:34 PM
leoluk added a subscriber: leoluk.Oct 28 2019, 1:04 PM

A saved state is likely something like this:

  • Zero or more "commits", except that these might also be "uncommitted".
    • In the case of arc save in an arbitrary Git working copy, we may have (at least) two uncommitted states ("staged" + "unstaged").
    • It's possible we might have three (staged, unstaged, ignored) but "ignored" isn't a property of files. I think arc save should probably respect .gitignore unless invoked in some kind of arc save --flat mode.
    • In the case of "arc save" in some random directory which isn't part of a VCS, there's just one "commit" which isn't committed.
    • Although 99.9% of these objects will be 1:1 with Git commits, I'm inclined to call them something else ("substate" or "waypoint" or somesuch) to avoid confusion.
    • Waypoints may be connected in a DAG, as with commits.
    • Waypoints may reference other external waypoints as dependencies. arc load will start by resolving waypoint dependencies.

A Waypoint is:

  • Some metadata, like a commit message / author / time / etc.
  • A (possibly empty) tree-state. This might be a complete tree, but might also be a tree-delta, or maybe a mixture of the two.
  • When a waypoint is just a reference to some existing Git tree, maybe it doesn't have an actual Tree object, or maybe the Tree object is very thin.

A Tree is basically the same as a Git tree, I think:

  • If you run arc save in a non-VCS directory, we probably just produce a list of paths and blobs? It doesn't seem to make a ton of sense to store this as deltas.
  • In other cases, a list of paths and deltas, maybe with optional base states for sanity checking.
  • Probably always a list of paths? Even if we get a raw input diff we'd like to parse it path-by-path.

A Path is:

  • The path itself.
  • Some metadata (permissions flags, SVN metaproperties).
  • Some way to build the data state, either as a delta or a blob.

A State is:

  • One or more waypoints.
  • An author.
  • An associated repository PHID for policy stuff?

The silly nonsense we need in support tables:

  • Paths are arbitrarily long binary blobs.
  • Commit messages are arbitrarily long binary blobs.
  • SVN properties are arbitrarily long binary blobs.
  • Blobs are, legitimately, arbitrarily long binary blobs.
  • Diffs are arbitrarily long binary blobs.

Paths need to be ordered and should go in an external strings table. I think everything else can go in a single "blobs" table with fallback to Files; we never need to search for states which affect some particular SVN property. We might want to search commit messages in saved states (hopefully not?) but that could be indexed.


When building these objects outside of a VCS, they probably all need authorPHID and author-only visibility. The client would differential.path.edit each path into existence, then differential.tree.edit the paths into a tree, then differential.waypoint.edit the tree into a waypoint, and finally differential.state.edit the waypoint into a state.

If these are immutable, which would be nice, it may be hard to build a tree with more than ~4MB of paths via HTTP. This is something in the realm of ~100K paths, which someone will probably hit while doing something very off-label. I don't have a clean fix here.

Likewise, it may be hard to build a state with more than ~4MB of waypoints. This is ~100K commits, which someone will probably also hit while doing something very off-label. For example, git clone some-big-third-party thing; arc save would clearly mean "save this entire repository's history in states", and at least some users will "Y" through any number of warnings about how ridiculous they're being. So maybe these objects need "constructing" and "finished" states. But perhaps I can cross that bridge when we come to it.