Page MenuHomePhabricator

Add a prompt to allow pruning merged branches when using --hold
AbandonedPublic

Authored by cspeckmim on Jul 11 2021, 5:57 AM.

Details

Reviewers
None
Group Reviewers
Blessed Reviewers
Maniphest Tasks
T13546: Modernize the "arc land" workflow
Summary

Refs T13546

Prior to changes in T13546, when using arc land --hold with mercurial the resulting state had both merged the branch and pruned the merged branch, effectively doing everything except pushing the squashed commit upstream. This behavior was different from git, where git would not prune the merged branch in addition to not pushing the squashed commit. The behavior used with git was the intended behavior, and in this sense mercurial's behavior was abnormal. Changes from T13546 brought mercurial's behavior to match git's.

There are a few use cases where pruning the original branch while using --hold has been desirable, though admittedly these are likely very specific to our internal processes. I've discussed with colleagues to understand more use cases than the one I was originally familiar with, and it turns out several have refused to upgrade arcanist because of this specific change in behavior. The use cases we have are:

  1. When making fixes in older versions of a product, we require all changesets exist in the repos for newer versions before allowing it to be pushed to the older version. In this case, using plain arc land from the older version will fail due to the push being rejected. Previously we used arc land --hold, jump to the newer versions, pull & merge the new commit to land, then push up that commit in all versions. Now it requires some additional handling to clean up the merged branch, as well as requiring more-specific pulls between repositories as by default mercurial will pull all commits.
  2. There are some products which require re-computing new generated code based on the new state of the codebase. Using arc land will pull in new changesets and rebase the work to be landed on top. Previously arc land --hold was used to get everything setup to go, a task is run to compute new generated code, then ammended to the commit to push. This now requires additional cleanup of the original branch.
    • I suspect there might be changes that could be made to the develepment process (e.g. always pull & rebase, generate code, update diff, then land) or accomplish this with CI/CD however I do not work on this product and am not involved in their development processes.
  3. Some colleagues really want to inspect everything before pushing things upstream. Also here the conflict is in having to clean up the old branches.

This change adds a new prompt arc.land.prune-after-hold which appears when using --hold and allows the user to choose whether the merged branch should be pruned.

As part of this I updated ArcanistPrompt to allow specifying that the prompt does not abort the workflow if the user responds n, and the allowing the user's response to be retrieved. This is because the response to this new prompt changes the behavior of arc land instead of stopping the workflow.

Test Plan

In mercurial I created 3 diffs, each with 2 commits. I ran arc land diff2 --hold and saw a prompt asking if the branches should be pruned after the hold. I entered y and after the land completed I verified that the old merged branch was pruned and the resulting working state allowed for pushing up the new master. I also verified that the indication directions after the hold only indicated pushing up the changes and not how to restore back to the original state, as once the branches are pruned this is no longer possible.

I set up the same scenario and when I ran arc land diff2 --hold I answered n to the prompt. After the land completed I verified that the state of the local repo contained both the squashed commit as well as the original branch. I also verified that the instructions for the hold also included steps to restore to the original state.

I setup the same scenario again and I ran arc land diff2 without the hold. I verified that the two revisions landed properly and the original branches were pruned.

I setup the same scenario again and I ran arc land diff2 without the hold, but when prompted to confirm landing the revisions I answered n and the land workflow was cleanly aborted/stopped, verifying that existing prompt behavior is in tact.

TODO:
Same test in git

Diff Detail

Repository
rARC Arcanist
Branch
holdprune
Lint
Lint Passed
Unit
Tests Skipped
Build Status
Buildable 25441
Build 35152: Run Core Tests
Build 35151: arc lint + arc unit

Event Timeline

src/land/engine/ArcanistLandEngine.php
1266

Fun note, at one point I accidentally had the pruning (not cascading) happening after every merge and not after all merges. The result is that mercurial evolve got very confused, to where I might try to reproduce with minimal setup without arcanist and report it as a bug. Very odd things started to happen, like mercurial marking an obsolete commit as tip, moving a bookmark to an obsolete revision (I guess this is allowed?), and when moving that bookmark the tip revision went to seemingly arbitrary branch heads (this one felt like there's code in there trying to handle tip being somewhere it's not supposed to be).

Update ArcanistPrompt::execute() to return the result rather than having to retrieve it via a getter function. This makes the use of the prompt more ergonomic.

Rewrite comment for clarity

If the merged branches are pruned then reconcileLocaleState() to put the repository in a more-expected state, as if having not used --hold but everything still needs pushed.

Slight cleanup to logic around reconciling vs. discarding local state and calling didHoldChanges(). Looking at the implementations of didHoldChanges() I don't believe it relies on the local state having been discarded or not (primarily it re-applies things previously stashed). The implementations primarily use it for the original chagesets/bookmarks to reference.

There are a few use cases where pruning the original branch while using --hold has been desirable, though admittedly these are likely very specific to our internal processes.

I generally think this is too niche to upstream -- it quite possibly affects only your install. Although it sounds like it wasn't the behavior under Mercurial previously, the intent of --hold is to serve as a dry run and abort the workflow midway through, e.g. it is documented as:

With --hold, execution stops here, before the change is pushed.

...that is, execution stops before:

... Once everything has been pushed, cleanup occurs. ...

I'm comfortable with having --hold print out the prune commands, but I don't want to add more prompts and flags upstream to address use cases at this level of niche behavior. On the particular use cases:

When making fixes in older versions of a product, we require all changesets exist in the repos for newer versions before allowing it to be pushed to the older version.

This definitely feels extremely niche -- I can't recall seeing another organization that has different repos for different production versions, and requires that one repository be a strict subset of the other repository (approximately?). I'm sure there's some reasoning for this but it seems like "one repository, with more steps" at first glance.

There are some products which require re-computing new generated code based on the new state of the codebase.

This is a worthy and reasonable use case, but I imagined addressing it by giving arc the ability to run "codegen / post-merge-rebase" steps. It would then be able to handle code generation after rebases, and could get this right for cascading rebases, arc patch, etc. Phabricator has this issue too, although it never rose past the level of vague annoyance.

Some colleagues really want to inspect everything before pushing things upstream. Also here the conflict is in having to clean up the old branches.

I'd expect they can run arc land --hold as a "dry-run", then repeat the command without --hold to get to the desired state. This should (I think?) be less work than copy/pasting or retyping a push command, although maybe it takes slightly longer (a few seconds?).

Generally, pruning after --hold means that if they run arc land --hold, get up to grab a coffee, come back, push, and the remote has moved ahead so the push is rejected, it's nontrivial to fix things. You can't just run the same command again since none of your local refs exist. Maybe they're comfortable getting out of this state, but I think this is a tricky state for newer users in general.


I'd conceptually be comfortable with marking these refs as "maybe-landed" and letting some other process (like hypothetical arc cleanup in T3277) remove them, but I'm not comfortable with bringing a workflow upstream that can leave the repository in a state that (a) hasn't been published and (b) can not be published by re-running arc land (since refs were deleted). A great deal of the complexity in arc is trying to avoid states where users can't move forward by running arc land again after something unexpected happens.


This actual code is reasonable if you're comfortable with arc land putting the repository into a state that arc land can not get the repository out of. The change to ArcanistPrompt may not cover all prompt use cases (e.g., maybe setReturnsResponse() rather than setAbortsWorkflow() covers more ground in the long term?), but it's generally reasonable and I imagined prompts eventually including decisions rather than just workflow aborts. So there's nothing bad here technically, I just want to upstream arc land to continue having great difficulty putting working copies in states with no simple way forward.

Rebased. I think I did this correctly?

Added some comments based on new understandings about the land process.

Mercurial's reconcileLocalState() does not rely on changes having been pushed but I'm not yet sure whether Git's reconcileLocalState() can currently operate without this assumption.

cspeckmim published this revision for review.EditedJul 12 2021, 12:03 AM

Oh just noticed your comment on this, marking for review so emails go out (I don't plan to make further changes here).

I'd expect they can run arc land --hold as a "dry-run", then repeat the command without --hold to get to the desired state.

Would the expectation then be that running arc land again would remove the rebased/squashed commits created by --hold, so the result would be as if they hadn't run --hold previously? Or is that what the arc cleanup command would be responsible for?

I generally think this is too niche to upstream
I just want to upstream arc land to continue having great difficulty putting working copies in states with no simple way forward.

Hm okay I'll stew about this some more to consider what solutions acceptable for upstream would also work for us, or if we should only provide something like this in our fork, even if temporal until a more appropriate solution comes along. The arc cleanup sounds like it might be a good approach, though my understanding right now is that arc is effectively stateless on the client machines and implementing something like cleanup would require storing local state and ensuring it is regularly synced with the server.

When making fixes in older versions of a product, we require all changesets exist in the repos for newer versions before allowing it to be pushed to the older version.

This definitely feels extremely niche -- I can't recall seeing another organization that has different repos for different production versions, and requires that one repository be a strict subset of the other repository (approximately?). I'm sure there's some reasoning for this but it seems like "one repository, with more steps" at first glance.

Hmm yea. Until ~2012-2015 the officially recommended strategy for branching in Mercurial was cloning the entire repository, which is where we end up with separate repositories instead of just different branches (we adopted Mercurial in ~2008). We regularly consider switching to normal branches however it's never made it to the top of our list of things to tackle. Though, whether a single repository or multiple, the restriction around ensuring that commits exist in subsequent(?) versions/branches is to enforce that the engineer responsible for fixing the problem also makes sure that it gets fixed in all later versions via the merging. We've also discussed switching over to what seems like the more conventional process by having fixes cherry-picked from trunk into the released versions, though we have concerns over how well that would scale. Right now we have server-side scripts when pushing to release/patch versions which check for the existence of the incoming commits to happen in the subsequent versions/branches/repos and rejects them if they don't. Off-hand I don't know of a solution that could provide that same guarantee for cherry-picking. Our current process also requires that the engineer who authored the fix is the one responsible for merging it, rather than having release engineers or an unlucky engineer responsible for performing all merges into the subsequent versions.

There are some products which require re-computing new generated code based on the new state of the codebase.

This is a worthy and reasonable use case, but I imagined addressing it by giving arc the ability to run "codegen / post-merge-rebase" steps. It would then be able to handle code generation after rebases, and could get this right for cascading rebases, arc patch, etc. Phabricator has this issue too, although it never rose past the level of vague annoyance.

This sounds interesting. Maybe something in the .arcconfig like a arc.land.pre-push-exec config that is a command to execute to apply further changes, then amend into the squashed commit, probably updating the diff in phab, then pushing up? Hmm..

Also - I added some comments in here to ArcanistLandEngine which I think would be generally useful. Are you okay if I move those over into D21680 before landing that?

Would the expectation then be that running arc land again would remove the rebased/squashed commits created by --hold, so the result would be as if they hadn't run --hold previously?

I'm not 100% sure what actually happens in Mercurial today, but in Git: the rebased/squashed commits aren't reachable from any refs, so Git automatically deletes them after a couple weeks. If you run arc land --hold and then arc land (and whatever you're landing onto hasn't moved ahead between your two executions), you'll end up with an exactly equivalent set of commits (exact same changes to the underlying code), so if you inspect the arc land --hold changes, any observations you make should apply exactly to the arc land changes that actually publish.

The commits created as a side effect of arc land --hold aren't actually deleted explicitly, but they're effectively gone in Git as soon as you update your working copy to any other state. I'm not sure how closely this holds (or can hold) in Mercurial.

In either VCS, if master or default or whatever has moved forward while you were inspecting changes, the second arc land may not produce exactly the same change the first one did (it will be rebasing onto a different base) but the changes should generally be trivial. If you'd used arc land --hold + inspect + manual push, your manual push would have failed anyway.

The arc cleanup sounds like it might be a good approach, though my understanding right now is that arc is effectively stateless on the client machines and implementing something like cleanup would require storing local state and ensuring it is regularly synced with the server.

Right. It doesn't necessarily have to be synchronized to the server -- I think it would be fine to say "local branch Q at commit R in this working copy was hold-landed into commit S on 2020-03-13, so arc cleanup can consider it safe to delete now if Q still points at R and S is present in the remote". I don't think most users would expect arc land --hold in one working copy to clean up branches with the same name in a different working copy (even the old accidental arc land behavior wasn't this powerful!).

A local cache would be good enough to let arc cleanup almost always do arc patch cleanups safely, too (e.g., change X was patched to branch Y at commit Z on date D, so require --unsafe --force --really --aggressive flags if D was very recent or Y no longer points at Z or the user happens to create a branch named patch-1234, and/or arc patch could then begin using human-readable names).

There's also some general discussion of this kind of additional local-or-local-and-remote commit mapping in T3875.

Also - I added some comments in here to ArcanistLandEngine which I think would be generally useful. Are you okay if I move those over into D21680 before landing that?

Yep, those look good to me.

The commits created as a side effect of arc land --hold aren't actually deleted explicitly, but they're effectively gone in Git as soon as you update your working copy to any other state. I'm not sure how closely this holds (or can hold) in Mercurial.

Ah yea this is one difference with Mercurial and Git's treatment of branches. In Mercurial there is no automatic cleanup, so things must be stripped/pruned manually before they will go away -- regardless of whether they're tagged with a bookmark or not. I'm still piecing together Git's handling of branches in my mind as I have used Mercurial for so long. Conceptually I know how I want to manipulate the repository but then things happen which reveal that how I'm thinking the repository is laid out isn't what Git's doing (Arcanist has been very useful in this respect).

Thank you for all the information here, I think it will help to lay out some experimentation and then plans for future plans here. I'll abandon this but use the information here for reference.

Just thinking aloud to myself

Off-hand I don't know of a solution that could provide that same guarantee for cherry-picking. Our current process also requires that the engineer who authored the fix is the one responsible for merging it, rather than having release engineers or an unlucky engineer responsible for performing all merges into the subsequent versions.

This could probably be handled by having the server-side hook check for commits with the same task number, as we require all commits to start with a task number. It's not as "comfortable" as ensuring the same changeset exists in the branch but since the changesets require merge commits which can potentially introduce incorrect merges that comfort might be just as misleading.

I'm still piecing together Git's handling of branches in my mind as I have used Mercurial for so long.

I think the big difference is that in Git, commits don't have to be on any branch. Git's branches are more or less like Mercurial's bookmarks, and there's no direct analog to Mercurial's branches in Git.

Git branches and Mercurial bookmarks are roughly named labels that point at some commit, and which will move forward automatically to point at descendant commits in some circumstances. (Tags in Git are roughly named labels that point at some commit but don't move automatically.)

When a commit is on one or more branches in Git, it just means that the commit is an ancestor of the commit(s) those branch labels point at.

In Git, we can just create new commits floating out in the middle of nowhere -- not on any branch, and not the ancestor of any labeled commit -- that you can retrieve if you know the hash but aren't otherwise reachable or visible to users without using very low-level commands. This isn't normally useful for human users, but it's great for reducing side effects in arc land, since we can do all our work in a sort of temporary scratch area that automatically gets cleaned up later.

In Git, we can also move labels in the remote at push time, i.e. push to a particular label, independent of how a commit is labelled locally. So we can build a floating-in-the-middle of nowhere commit, then push it to "master" in the remote, saying "transfer the actual changes, then update the master label to point at this commit". On its own, this doesn't cause any local side effects (no new labels are created or moved). So we have a lot of control over exactly which local and remote labels get updated, and can do the entire land process with no side effects and no user-visible modifications to repository state until we actually commit to publishing changes. Once we're ready to publish into the remote, we can precisely control all the local and remote label updates.

I think this is generally good, since it means arc land is less surprising in error cases: it doesn't modify any working copy state or labels (unless you look deep, deep under the hood) and if something goes wrong you can almost always just run arc land again to converge to the desired state -- or at least reproduce the problem exactly, so it can be reported upstream and hopefully debugged.

This is trickier in Mercurial since it's harder or impossible to perform some of the operations narrowly in a side-effect-free way. We could perhaps imagine doing the work up to the point of --hold in some branch named __temporary_workspace_ignore_me and then moving commits to the real branch before publishing them, but this would probably be pretty messy in practice and if --hold is to allow the user to directly hg push, we can't leave them in a state where they have to move commits between branches.

Maybe a cleaner approach is storing local metadata ("commit C on label L is an arc land --hold commit from landing Q") and having arc land Q --onto L check if L currently points at some commit known to be a leftover of a previous land operation. Then it could just strip it safely rather than erroring out ("local L has unpublished commits"), and arc cleanup or similar could also strip these, and arc branches and such could show these labels (and arc patch branches/bookmarks, and landed-but-not-deleted local feature labels) as tainted/unclean/ripe for cleanup.

In Git, we can just create new commits floating out in the middle of nowhere -- not on any branch, and not the ancestor of any labeled commit -- that you can retrieve if you know the hash but aren't otherwise reachable or visible to users without using very low-level commands. This isn't normally useful for human users, but it's great for reducing side effects in arc land, since we can do all our work in a sort of temporary scratch area that automatically gets cleaned up later.

Woah, yea I don't think there's anything like this in Mercurial. I take it these disparate trees have some limitations? Do they always apply to the working directory or something?

Mercurial has "secret" phase though that doesn't hide it from the user at all. Maybe the closest thing would be obsolete markers from the evolve extension, though commits with obsolete markers aren't cleaned up automatically and I don't think you can operate on those changesets (I just tried rebasing an obsolete commit onto another obsolete commit and got some error about "no successors").

In Git, we can also move labels in the remote at push time, i.e. push to a particular label, independent of how a commit is labelled locally. So we can build a floating-in-the-middle of nowhere commit, then push it to "master" in the remote, saying "transfer the actual changes, then update the master label to point at this commit". On its own, this doesn't cause any local side effects (no new labels are created or moved). So we have a lot of control over exactly which local and remote labels get updated, and can do the entire land process with no side effects and no user-visible modifications to repository state until we actually commit to publishing changes. Once we're ready to publish into the remote, we can precisely control all the local and remote label updates.

I think this is generally good, since it means arc land is less surprising in error cases: it doesn't modify any working copy state or labels (unless you look deep, deep under the hood) and if something goes wrong you can almost always just run arc land again to converge to the desired state -- or at least reproduce the problem exactly, so it can be reported upstream and hopefully debugged.

Yea this does sound really useful here. Does this mean that when using arc land without --hold it will push from the phantom branch as well? From a quick test it looks like the --hold instructions do this, which I think are the same ones used during land. So then a full arc land would then pull the resulting changes from remote and strip out the local source branch.

This is trickier in Mercurial since it's harder or impossible to perform some of the operations narrowly in a side-effect-free way. We could perhaps imagine doing the work up to the point of --hold in some branch named __temporary_workspace_ignore_me and then moving commits to the real branch before publishing them, but this would probably be pretty messy in practice and if --hold is to allow the user to directly hg push, we can't leave them in a state where they have to move commits between branches.

When using --hold with Mercurial it seems to do roughly the same thing as Git however, A) the phantom branch is more corporeal, and B) the master bookmark is advanced to the new tip (both before and after the modernization update). I think (B) might be where the difference in expectation comes from, as I've (we've) been used to having --hold already operate on the local master branch where in Git that doesn't happen. Additionally I'm also used to having the evolve extension installed where I'm much more quick to prune branches/commits to get them out of my way but can always resurrect them if they're needed.

Maybe a cleaner approach is storing local metadata ("commit C on label L is an arc land --hold commit from landing Q") and having arc land Q --onto L check if L currently points at some commit known to be a leftover of a previous land operation. Then it could just strip it safely rather than erroring out ("local L has unpublished commits"), and arc cleanup or similar could also strip these, and arc branches and such could show these labels (and arc patch branches/bookmarks, and landed-but-not-deleted local feature labels) as tainted/unclean/ripe for cleanup.

Yea that does sound great; having local metadata does seem like a good next big step for Arcanist. Slightly related, I know there's also some desire for arc patch be more aware about existing patched changes (with regards to dependent revisions I think) and it's bookmarks, which I think will also need this type of metadata to operate effectively.

It's taken me a few times reading through but I think I'm getting a better grasp on the arc land design and process. Thanks for taking the time to write up these responses.

I take it these disparate trees have some limitations? Do they always apply to the working directory or something?

There aren't really any limitations that I can think of, Git just exposes lower-level operations than Mercurial does with "plumbing" commands that regular users almost never interact with.

In Git, you can create any low-level object (blob, tree, commit) without interacting with the working copy at all by using git hash-object. All Git objects have an internal string representation, and a third-party program can build that string itself (or use other plumbing commands which help build the string) and then write it into the repository with git hash-object without interacting with the current working copy state at all. git hash-object prints out the hash it wrote, so the caller can continue doing useful work from there.

Arcanist does technically use hash-object in one case (when you arc land into an empty repository, see here), but in other cases it doesn't need to go quite this low level. Various higher-level-but-still-not-normally-user-facing commands and workflows also exist, like git commit-tree (create a commit from a repository state other than the repository state currently in the working copy).

More simply: if you have some normal directory full of files and other directories, you can turn it into a Git state by using git hash-object to build Git "blobs" from each file, then Git "trees" from all the blobs you built, then a commit from the tree, and produce a normal Git commit which exactly represents the original directory state, all without modifying the working copy (or even needing to have those files on disk at all -- the content could be generated synthetically by a program instead). The high-level Git commands mostly do exactly this, by calling the low-level Git commands, at least conceptually if not in practice.

If you do this, all these files and trees and commits and stuff are just "rows in Git's database", and you're just using git insert-a-row-into-the-database, effectively. As long as the data you provide is in approximately the right format for the type of object you're inserting (see T11537 for an unusual case), Git doesn't care if the filesystem objects that data represents never actually existed anywhere on disk.

Does this mean that when using arc land without --hold it will push from the phantom branch as well?

Yes. In Git, we (almost) don't need to modify (user-visible) local state until the remote server tells us that the correct landed state has been successfully published. If something goes wrong before then, we can nearly exit directly into the original state.

(This isn't entirely true because we do change the working copy state to do some merge operations, and there are some details with dirty working copies (where users may have local uncommitted changes or untracked files) that we can most easily deal with by saving/restoring state. So we do still save/restore state, but not very much, and generally don't have to "rewind" any work we've done since all our changes are ephemeral until they hit the remote.)

This is also shedding light on T13107 :)

src/land/engine/ArcanistLandEngine.php
1266

I'm starting to narrow down what's going on and seeking help in #mercurial. This bizarro issue is reproducible without any changes from this diff. I can reproduce this on the master branch by running arc land --hold, pruning the resulting squashed branch, then running arc land again. If I use strip instead of prune to remove the --hold branch then this behavior doesn't happen.

Somehow the rebase command that is running to execute the merge is resulting in the squashed rebased commit being obsolete as well as being tip. This causes the next iteration of merging to go poorly.

src/land/engine/ArcanistLandEngine.php
1266

I reported the issue with reproduction steps in the mailing list. Apparently running rebase --collapse --keep then prune the resulting commit is all that's needed for setup. If you then run the exact same rebase command again Mercurial recognizes that you already created that rebase/squash commit and pruned it, and does not bother creating a new rebase/squash commit. It also moves your bookmark to the obsolete commit. Sometimes it tags the obsolete commit as tip.