Page MenuHomePhabricator

Build a changeset cache for Diffusion to improve move/copy detection behvaior
Open, NormalPublic

Description

If you have a commit that both moves a file and modifies it, and you create a revision for the commit, the Differential UI helpfully shows line-level diffs indicating what lines in the moved file were also modified.

For example, look at the diff for PhutilTestCase.php in in D12665#a45fc076.

However, after committing, if you view the commit in Diffusion then the entire file appears green and you cannot see the line-level diffs: rARC753705b2#a45fc076.

The reason this is a problem is for teams that use a post-commit review (Audit) workflow. Audits for files that were both moved and changed in a single commit are difficult to do in the web UI because you cannot see line-level diff info. Instead, reviewers need to view the commit using git command-line tools, or committers must remember to create two separate commits (one to rename, one to modify).

Event Timeline

This is because we hate code reuse.

I think there are two issues here:

  • detectCopiedCode(): We don't call our application-level move/copy detection code from Diffusion, for no real reason. This draws the yellow/orange bars.
  • Lost Move/Copy Path Data: We aren't pulling the file-level move/copy data properly. This does the body diff and header ("This file was copied from...").

Fixing detectCopiedCode() appears simple, we're basically just missing a call to detectCopiedCode(). This is possibly made slightly complicated because we should probably make this option a default, as I believe there is no situation in which we don't want to perform this detection (even, e.g., Paste diffs would benefit from it).

I think we're losing the "move" data because we do this:

  • From the main process, we run git diff --various-flags A B --, where --various-flags include -M and -C to detect moves and copies. This returns the correct results, and renders the table of contents properly (e.g., in the second link above, the table of contents shows that the file was moved, even though the file itself does not).
  • We return the page, then have the client queue up a bunch of Ajax requests to render the actual diffs.
  • However, these run git diff --various-flags A B -- pathname.c since each is rendering the diff for exactly one path.

The behavior of the -M / -C flags in Git is, as described by the --find-copies-harder paragraph in git help diff:

For performance reasons, by default, -C option finds copies only if the original file of the copy was modified in the same changeset.

I suspect "changeset" in this context means "active working set of changes examined by the command", and that by specifying a path we're inadvertently dropping the copy/move source from the working set.

Empirically, this seems to be accurate. Compare generating the whole diff, then grepping for the path:

$ git diff -M -C --raw 753705b2^ 753705b2 -- | grep src/unit/engine/phutil/PhutilTestCase.php
:100644 100644 54aa730... a3f9851... R097	src/unit/engine/phutil/ArcanistPhutilTestCase.php	src/unit/engine/phutil/PhutilTestCase.php

...to generating the diff for only the path;

$ git diff -M -C --raw 753705b2^ 753705b2 --  src/unit/engine/phutil/PhutilTestCase.php
:000000 100644 0000000... a3f9851... A  src/unit/engine/phutil/PhutilTestCase.php

In the first case, git has detected the move (file is marked R097, for "rename"). In the latter case, it has not (file is marked A, for "add").

I think this is at least arguably a bug in Git, since this meaning of -M and -C seems fairly unintuitive and nonsensical, but like the kind of thing that could happen by accident if you didn't think about this case. Instead, I would expect these flags to examine all the paths in the selected commits -- that is, for a list of paths to have no impact on the behavior of -M or -C.

The documentation is also not exactly forthcoming about this interaction. However, it's possible that this is intended and some use case exists for git diff -M -C -- A B C to find copies/moves within some known set of paths or whatever, or that this is a bug but hard to fix. It is unlikely that this has much of an impact on the overwhelming majority of users.

Some things we could do here:

  1. Pass information about moves/copies to the calls and have them mangle the results.
  2. Try to change the behavior of Git, as the current behavior seems reasonably likely to be unintended.
  3. Keep the architecture as written, but grab the entire diff, then throw away the parts we don't need.
  4. Grab the whole diff up front, fill some sort of cache, then just read the cache in the Ajax calls.
  5. Use --find-copies-harder to make Git search for moves much more aggressively.

(1) would work if we only needed to fix the headers ("This file was copied from...") but we need to fix the diff bodies too.

(2) might work but probably has a timeline of 200 years.

(3) is messy because we don't have a parser which keeps the diff as blobs of text: it's a one-step parse from raw text diff into an object representation. We can't trivially reconstitute this or ship it over the wire, and it isn't useful to third-party clients if we do.

(4) is probably the best approach. This is similar to what Differential does, interacts with fixing detectCopiedCode(), probably improves overall performance, and doesn't really run us into any major pitfalls.

(5) leads to very bad worst-case behavior (see D16266) and isn't a good/scalable approach. It is possible that we should be using this flag in more cases to simply improve results, but that's not really directly relevant to this issue.

Also, thanks for providing pointers to reproduction cases on this install. Being able to fiddle with those commits locally was helpful in confirming the behavior of -M and -C with paths.

This may also be a good opportunity to fix T10423 since we'd end up testing the same code.

epriestley renamed this task from Audits don't show line-level diffs for changed and moved files to Build a changeset cache for Diffusion to improve move/copy detection behvaior.Aug 16 2016, 10:59 PM
epriestley triaged this task as Normal priority.
epriestley added a project: Diffusion.
epriestley moved this task from Backlog to v3 on the Diffusion board.
epriestley edited projects, added Diffusion (v3); removed Diffusion.

Thanks for the detailed investigation -- glad we've root-caused this and proposed a solution :)

T10423 was resolved indirectly (in connection with T11524) so we've sort of inched forward here.