Page MenuHomePhabricator

When multiple commits correspond to a single revision, Differential updates to show one of them arbitrarily
Open, Needs TriagePublic

Description

This seems to happen pretty consistently to our diffs. We use git for VCS if that matters.

I'm unsure of the exact cause, but I have a couple example workflows which may result in this happening:

  1. Create diff
  2. Update the diff with some new changes (perhaps multiple times)
  3. Land the diff with arc
  4. Observe changes on the diff do not represent the last revision of the diff or the changes that were actually landed (if different)
  1. Create diff
  2. Rebase branch onto updated master (may also happen if merging master into the branch)
  3. Update diff with changes (arc diff --update <diff id>)
  4. Land the diff with arc
  5. Observe changes on the diff do not represent the last revision of the diff or the changes that were actually landed (if different)

It *seems* like the changes shown are the changes for the original revision of the diff. I haven't gone through and verified that's always the case, however.

Event Timeline

jcline raised the priority of this task from to Needs Triage.
jcline updated the task description. (Show Details)
jcline added subscribers: jcline, epriestley.

Do you use the --merge / history.immutable strategy when landing, so the actual commit which lands is a merge commit? My first thought here is that we might not be importing merge commits correctly. I haven't seen issues with the updates on this install, but we use the rebase/squash strategy.

I always use whatever the arc default is. I know some of my coworkers do other things, however.

Ah, looking at arc help, I'm pretty certain we have immutable history enabled. We do get the merge commits with arc land by default.

Particularly, if you land with merges, I could imagine this explaining the behavior you're seeing:

  • You land with a merge, which pushes commits a1, a2, a3 and a4 (the merge) to the remote.
  • We parse a1 first, match it to the revision based on hash, and close the revision. We update the revision with the changes in a1.
  • We parse a2, a3, and a4, but don't update the revision because it has already been closed by a1.

Looking at one of our recent examples of this, the commit it says the diff was closed by was the first commit made on that branch. Seems like a plausible theory.

Assuming that's the issue, this isn't especially easy to fix because we parse commits in parallel, and the correct diff is basically "all contiguous attached commits". They may end in a merge commit, but there might also be several contiguous commits at the branch head without any merges. arc land can not normally produce this, but you can git push that state pretty easily.

In theory, you can also push discontinuous matching ranges, but you'd have to do something abusive for that to work.

One fix might be to generate the diff dynamically, at display time, instead of when we discover the commits. This would let us figure out the range much more easily. A drawback is that we could no longer detect that changes were made prior to commit, but I'm not sure that's useful or accurate anyway, and it's becoming less relevant as we move toward hosting and having a functional "Land This" button in the web UI (T182). This would also be a bit slower.

Another fix would be to get rid of this update-on-commit feature entirely. I'm not sure how useful it is; at least personally, I never use it. The changes are viewable in Diffusion and are linked from the revision after changes land anyway. If anything, I'd rather see the last reviewed version.

A third approach would be to restructure the parsing so that this part of it is sequential. I don't immediately see a clever way to do that which is simple and noninvasive, although we could probably come up with something.

A fourth approach would be to keep the parser parallel, but update the attached diff if the parsed commit is a descendant of the current attached diff. This seems really bad/messy.

Do you actually use this feature? If we just stopped updating the displayed changes on commit, would that be reasonable in your use case?

epriestley renamed this task from Phabricator often doesn't show the correct changes after a diff is landed. to When multiple commits correspond to a single revision, Differential updates to show one of them arbitrarily.Feb 19 2014, 2:50 AM
epriestley added a project: Differential.

Do you actually use this feature? If we just stopped updating the displayed changes on commit, would that be reasonable in your use case?

Hmm. Since this feature doesn't really work for us (to my knowledge) I don't think we use it. Perhaps other people, do, though?

I do think if the feature worked it might be nice to have. Sometimes people request minor changes in their accept comment (which I'm not a fan of) and sometimes people make minor changes before landing (e.g. incrementing a version number). A link to the correct changeset in diffusion would also be entirely adequate, but currently this seems to suffer from the same issue.


I'm not sure I fully understand this suggestion:

One fix might be to generate the diff dynamically, at display time, instead of when we discover the commits. This would let us figure out the range much more easily. A drawback is that we could no longer detect that changes were made prior to commit

I admittedly don't know a great deal about how phabricator works, but it seems, to me, that if doing this dynamically is easy then why couldn't that be done as a second step to what's currently done?

How I think things work based on what you've said, at a super high level:

  1. Notification of new commits to repo
  2. git pull/fetch
  3. Process all new commits in parallel
    1. Update relevant diffs/diffusion per commit

So, maybe something like:

  1. Notification of new commits to repo
  2. git pull/fetch
  3. Make dict<diffid, commitlist>
  4. Process all new commits in parallel
    1. if commitHash is part of someDiffId (I assume there's already a mechanism for detecting this)
      1. dict[someDiffId].Append(commitHash)
    2. update diffusion
  5. parallel for id in dict
    1. sort dict[id]
    2. update id with dict[id]

Maybe that makes sense and isn't completely crazy for how things are currently designed. It might be overly complex/hard to implement, though. =/

A third approach would be to restructure the parsing so that this part of it is sequential. I don't immediately see a clever way to do that which is simple and noninvasive, although we could probably come up with something.

I think my above suggested approach is effectively this?


Apologies for not being familiar enough with the code to know for sure if my suggestions are particularly helpful!

A link to the correct changeset in diffusion would also be entirely adequate, but currently this seems to suffer from the same issue.

We should attach all the commits, and if the last one is a merge commit then it clicking it should show the correct changeset, although I think they attach in potentially arbitrary order and there's no way to see the range right now. But I'd potentially rather focus on improving this (e.g., giving you a big "see changes in Diffusion" button which works correctly) than fixing the import-changes-on-commit feature. The big theoretical advantages of import-on-commit is that you can diff the changes as committed and the changes as reviewed, but we could also let you diff between Differential and Diffusion, in theory.

I admittedly don't know a great deal about how phabricator works, but it seems, to me, that if doing this dynamically is easy then why couldn't that be done as a second step to what's currently done?

(See below, I think -- I wrote the bit below first, but I think it answers this question too, since the two issues below are basically the two issues here, beyond some minor slowness from not having the cache prebuilt.)

I think my above suggested approach is effectively this?

Yeah, that's roughly the third approach, although we don't have quite that level of control. Particularly, the parallel steps are happening in completely different processes, possibly on different machines, in arbitrary order, and possibly separated by a large distance in time. The pipeline is built so that it works if you push a million commits at once and you can throw as much hardware at it as you need to, but currently doesn't have any primitives for synchronization on local parts of the commit tree (since we don't need to do this for anything else).

We could still do something like the approach you outline, but can't easily synchronize it. I think the two issues are:

  1. Since we don't have any synchronization, it's inherently race-conditioney, where if you load the revision fast enough during the parsing process you'll see some partial state (where we've, say, appended a2, but not yet appended a3). We'd get to the right state eventually, but probably slowly enough at least some of the time (e.g., for large changes, which take a bit of time to pull out of the VCS) that humans could reasonably hit the partial states.
  2. The first time we detect a corresponding commit, we close the revision and send an email, possibly including a "CHANGED PRIOR TO COMMIT" section. In theory, this section is present only if the author made additional unreviewed changes after the last reviewed version of the revision. We can't make this determination accurately until we're done appending.

We could also mitigate this by removing the changes-as-committed diff when we found a second commit (so the feature would work for squash/rebase workflows, and just turn itself off for merge workflows). However, that would mean the "CHANGED PRIOR TO COMMIT" link would briefly show the wrong thing and then stop working. I guess we could kill it explicitly and put you on a page like "TURNS OUT THIS THIS LINK IS LIES, SORRY :(".

(A workaround is to use a squash/rebase workflow (I argue for squash/rebase in the documentation), but I assume that's not reasonable and that people are using merge workflows because they prefer them and wish to retain checkpoint commits, not because someone accidentally turned them on or something.)

The simplest way I can come up with to synchronize is:

  1. Writing commits to an "unparsed" table when they're discovered (we track this state internally, but not in a way that we can query efficiently, since it's in a bitfield with a bunch of other flags).
  2. Before discovery, check if all of a repository's commits are parsed. If they are, mark each unwalked branch/tag head as walked and queue a "walker" task for it.
  3. This means walker tasks will only execute on commits with fully parsed ancestors, which is sufficient for our purposes, under the assumption that you would never push a1, and a2 to HEAD, then wait a few minutes and push a3 (if you do, we can fix the diff once a3 arrives, but the email won't be right -- this seems OK). The walkers can process their commit, then mark their unwalked parents as walked and queue tasks for them.

I think the risks here are:

  • Greater complexity in the already complex import/parse machinery.
  • Since we can only queue walkers when the unparsed list is empty, they may get delayed for an arbitrarily long time in repositories with a high commit rate. At least some users are already pretty sensitive to the time between "arc land" and revisions closing, and this scheme would always make it longer/worse. I think we could use generational markers to mitigate this somewhat? This is starting to add a whole lot of complexity, though.

This would be easier to accept as a good approach if we had other things we could do with in-order parsing that we can't do without it, but I don't think we currently do.

On the other hand, none of the workarounds seem quite reasonable either.


An alternate approach would be to write commit ancestry edges during discovery, then to walk forward to the commit's descendants and do lookups for them. If they belong to the same revision, we skip the revision part for the commit we're on. This is simple and gives us some reusable pieces (notably, a cache for commit relationships) but somewhat inefficient (we'd try to associate commits and revisions a minimum of twice as often: once when parsing the commit, and once when parsing its parent) and has some race/lock issues if we try to cache it to reduce the inefficiency. Still, this seems more realistic as an approach.

Specifically, having that cache that lets us reduce this problem to:

$revision = $this->figureOutRevision();
foreach ($this->getChildCommits() as $commit) {
  if ($commit->figureOutRevision() == $revision) {
    // We'll deal with closing the revision and attaching stuff
    // when we parse the descendent (and maybe we already did that).
    return;
  }
}

// None of the direct descendants belong to the same revision, so
// update it and close it.
$this->attachToRevisionAndClose($revision);

The cost is that we run whatever's in figureOutRevision() a few times, but that stuff is pretty cheap and all cacheable with a little finesse if necessary.

An additional case from IRC, dealing with immutable histories vs just retaining checkpoint commits:

The commits which map to the revision may not be contiguous. An example is a revision for commits X, Y, Z, then local commits U, V, then arc land for merge commit M. Commits X, Y, Z are associated by hash; commit M is associated by message; commits U and V could be considered to implicitly be part of the revision because they're sandwiched by known commits.

The complexity of handling this makes adding a "the whole repository is caught up, deal with it" from T4453#13 sort of step more attractive. This would be far easier to motivate if we had more stuff that suffered from similar issues, though.

Do you know if this is likely to be fixed in coming releases? Any time I auto-close a review it gets multiple commits closing it which mess up the diffs. It's a bug I can teach users about and how to work around, but it is fairly obvious.

No worries, thanks for the link. I think I'll turn off auto-close entirely to avoid the issue until such time it does get resolved.

eadler added a project: Restricted Project.Sep 15 2016, 6:12 PM

The complexity of handling this makes adding a "the whole repository is caught up, deal with it" from T4453#13 sort of step more attractive. This would be far easier to motivate if we had more stuff that suffered from similar issues, though.

I don't really see any other way to deal with this. So the logic becomes:

  • In MessageParser, we write a silent "commit/revision" relationship placeholder into a queue.
  • When imports for a repository are caught up, we queue jobs for any revision in the queue.
  • The UpdateObjectAfterCommit worker holds a global lock and consumes the entire queue.

Then we're stuck figuring out which commit to generate a diff from, but the logic there is probably simple, or at least reasonable logic is simple: if there is exactly one non-merge commit, use that. Otherwise, don't update the diff.

epriestley moved this task from Next to Backlog on the Differential board.