Page MenuHomePhabricator

Clean up remaining "Autoclose" behaviors related to "One Revision, Many Commits"
Open, NormalPublic

Description

See T4453. When revisions are landed as a series of commits, Differential updates after discovering the first one. This may not accurately reflect what actually landed. This is a huge mess, but probably never going to get riper than it is right now.

D20451 was incorrectly updated (with an empty diff) after being closed by the stable-promotion merge commit. This is not expected. There are two sub-problems here:

  • The "setContinueOnNoEffect()" behavior changed to make edges write (desirable) but multiple "updated the revision" actions can now write too (not desirable). The easiest fix here is to no-op the update-after-commit if the revision is already closed.
  • See T7333. Empty commits have incorrectly associated with the revision of the parent for a long time, but now that we're doing a better job about being inclusive of "one revision, many commits" this is creating more visible problems.

Event Timeline

epriestley triaged this task as Normal priority.May 17 2019, 2:15 PM
epriestley created this task.

(This is mostly for my own notes and contains roughly zero insight.)


After recent changes, commit parsing currently executes in this order:

  • MessageParser
    • UpdateObjectAfterCommit (Revisions)
    • UpdateObjectAfterCommit (Tasks)
    • ChangeParser
      • PublishWorker

That is, the Message parser queues revision updates, task updates, and change parsing. Change parsing queues publishing. Publishing may run before, after, or concurrently with the "UpdateObject" steps.

T4453 describes a problem that this sequence creates: if you push a group of commits to the remote all at once and many of them associate with a particular revision, the first one to make it to UpdateObjectAfterCommit (Revisions) "wins" and closes the revision.

Prior to recent changes, the others would no-op. After recent changes, the others associate properly (write edges) but are not reflected in the automatic diff update.

This is a problem in particular when you use a merge strategy and push changes as a merge commit on top of several checkpoint commits. In this flow, history may also look like:

X    <-- Merge commit
| \
|  C <-- Typo fix
|  |
|  B <-- Changes from review feedback
|  |
|  A <-- Original Diff
| /
M    <-- Common ancestor on "master"

This corresponds to a user doing this:

  • Creating a local branch from M.
  • Creating commit A.
  • Running arc diff.
  • Creating commit B on top of A in response to reviewer feedback.
  • Running arc diff.
  • Creating commit C on top of B as a minor typo fix.
  • Running arc land or git merge + git push, without re-diffing.

In this case, Differential never sees commit "C", so the full set of changes that should associate with the revision: (a) won't all naturally associate; and (b) aren't contiguous.

Although I don't personally love this workflow, it's a legitimate workflow that I don't think is harmful (I just question the value of retaining checkpoint commits, but many other engineers feel this is pretty valuable).

T4453 also talks about this stuff in more detail. Because of this case (where the commits are noncontiguous), I'm inclined to reject any approach which tries to fix this by looking for parents or children.


Finding Related Commits by Finding Parents/Children

That is, usually (but not always) the worker queue will process commit A first, and one possible approach is to look at parents and children of A and try to find commits which associate with the same revision, and then attach them as a big group. However, the "typo fix that never got diffed" use case means that commits "B" and "X" may be separated by an arbitrarily large number of non-associating commits: if a user does 35 typo fixes, we'd have to look 35 commits deep, and can't stop the search just because we find a non-associating commit (or, in general, any number of non-associating commits). We also have to start this search from each commit and deal with resolving which ones are overlapping. Although this is probably navigable by imposing some kind of hard limit and some reasonable boundary conditions, I suspect it doesn't come out as the least messy approach.

The search part probably isn't really that bad, since if we find a commit that we already know about we can stop the search, and if we find a commit with no descendants we can stop the search, and if we find a commit which associates with a different revision we can stop the search. And we can stop the search after reaching some "reasonable" depth, like 32, and just say "if you made 32+ undiffed typo fixes before landing, sorry, we won't get it right, also what are you doing". This would generally lead us to the right group of commits fairly easily, I think, and not send us down a rabbit hole of searching 900-ply deep except in cases where you built a pathological repository with 32-deep octopus merges of a hundred commits at every level just to break the algorithm. And we could put a stop in place for that, too. But this is a lot of weird, hard/annoying-to-test edge cases and general complexity, and I don't want to execute the test plan for "created a pathological repository full of octopus merges just to make life difficult", and we still have to handle "A" and "B" starting searches at the same time.

Instead:

Finding Related Commits by Batch-Processing Imports

Instead, we could bake in a different assumption: if you're landing a commit for revision X, you're including all commits for X in the same push. This probably isn't always true, but it's a pretty reasonable sort of assumption.

Then, the new flow looks like this, per-commit:

  • MessageParser
  • Add the revision/task associations ONLY.
  • ChangeParser
  • PublishWorker

And this, fired each time a repository becomes fully imported:

  • Take all the new associations.
  • Queue a worker for each destination object.

A simple version of this would look like "write the edges, but don't actually queue the workers yet", in the MessageParser. The problem with this is that we don't have an efficient way to go back and do the processing afterwards: we can't easily tell which commits just got new edges, versus which commits had existing edges.

But we can add a new table to queue related object processing to solve this, and we should be okay. We probably end up with a fair amount of actual code, but it's all pretty straightforward.

The problems I can see are:

  • Commit publishing will fire before related object updates, so users may get mail pointing them at a commit which isn't fully associated with a revision yet, and the revision may not have closed/updated yet. But this actually already happens after the most recent round of changes and does not seem very concerning in general.
  • Herald fields can't use anything written by the "UpdateObject" flow, but they already don't.
  • This further muddies the already-very-complex import pipeline, and particularly muddies the very-hard-to-understand sequence of step dependencies on that pipeline.

A particular consequence of this is that if you use a Herald rule to "Run build..." on a commit, your build may run before the commit and revision get an edge. This isn't great. None of these are really problems but in a perfect world we'd probably sequence the publishing last, so we'd do:

  1. Parse all messages and changes in parallel.
  2. Run all the related object updates in parallel.
  3. Publish all the commits in parallel.
  4. Also, magically make steps (2) and (3) atomic?

The "magical atomicity" is because even if we do this in multiple batched stages to get all the edges written before we run Herald on the commits, we get the opposite problem: commits will get attached to tasks/revisions, but users may receive mail about them (or be able to see them in the web UI) before the publishing happens for the commits, so they don't see an accurate view of audit/build rules/whatever.

This all seems very marginal and we could conceivably write the edge halves separately if we really need to so I'm not terribly worried about any of it.


So we end up with:

  • A new table for relating commits to other objects.
  • This table has a "processed / not processed" flag of some sort but otherwise looks mostly like an edge table.
  • We can move the "revision association reasons (why commit X is associated with revision Y)" to this table instead of secret metadata on edge transactions, and make it permanent and part of the observability of relationships.

See also PHI1318. The new behavior here has tightened our rules about which users we act as.

In the general case, we have situations like this:

  • Alice can not see T123.
  • Alice pushes a commit which Fixes T123.
  • What should we do?

In the long term, I think the correct answer should be "do not close the task". Even though committers and authors are largely made-up and client-controlled, this behavior is generally more consistent from a policy viewpoint than "close the task", and we should try to do the "most correct" thing here and then support it elsewhere.

We've moved toward this behavior, but there are some issues:

  • If a commit is pushed as a user with no permission, or a mailing list, or a disabled user, we currently do not recover out of the UpdateObjectAfterCommit task gracefully.
  • We'd prefer to write the edge ("Commit X is related to task Y") even if we can't update the object.

So this probably slots into the other plans here, like:

  • When we process the pending items in this table, we first write edges to the commit as the committer/author, then let the task/revision edges pop in as inverse edits (which are allowed to bypass policy rules). This should give us good behavior in terms of preserving "commit X is related to task Y".
  • Just hide all these actions (or at least the inverse edits) unconditionally from feed/transactions. I think this is already our behavior after tweaks.
  • Then, apply the actual edits ("fix task", "close revision") to the objects, as the committer/author.
    • When these edits fail because of a policy issue, recover gracefully (permanently fail the edit and mark it as complete).
    • Some day, leave a record of this ("This commit says it fixes task X, but the commit author, Y, can not edit that task, so the task was not updated.").

See also T13552, which modifies the above discussion. The "Update" steps now happen after the "Publish" step.

A general concern with "batch processing" is that it's quite bad if one commit failing to import can stall the entire repository forever.

We can potentially fix that by marking all commits that the DiscoveryEngine finds in a single operation with a group number, then doing bulk-processing on commits group-by-group, instead of all-or-nothing. A single commit failing would then only stall its commit group, which is often just that commit.

So the flow would look like this:

  • When we start discovery, generate a group number.
  • When we discover a commit, queue a <groupNumber, commitID> item.
  • Break "update object after commit" into "queue" + "process" steps.
  • After publishing a commit, check if its group is fully published. If it is, flush the whole "update object after commit" queue for that group.

This is a bit messy, but seems tractable enough.