Page MenuHomePhabricator

Copy detection in Git is very eager about empty files
Open, WishlistPublic

Description

See PHI754. Git copy detection is completely content based, and it considers empty files to be copies of other empty files if invoked in the right way:

epriestley@orbital ~/tmp $ mkdir scratch
epriestley@orbital ~/tmp $ cd scratch/
epriestley@orbital ~/tmp/scratch $ git init
Initialized empty Git repository in /Users/epriestley/tmp/scratch/.git/
epriestley@orbital ~/tmp/scratch $ touch a.txt
epriestley@orbital ~/tmp/scratch $ git add a.txt
epriestley@orbital ~/tmp/scratch $ git commit -am 'a'
[master (root-commit) 5847103] a
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 a.txt
epriestley@orbital ~/tmp/scratch $ touch b.txt
epriestley@orbital ~/tmp/scratch $ git add b.txt
epriestley@orbital ~/tmp/scratch $ git diff HEAD -M -C --find-copies-harder
diff --git a/a.txt b/b.txt
similarity index 100%
copy from a.txt
copy to b.txt

This propagates into arc also considering empty files to be copies of other empty files.

This isn't exactly a problem, but can be counterintuitive. When humans think of a "copy", they care about why two files have the same content, not just that they have the same content. But Git has no way to know whether you used cp empty.py new.py (which humans would expect to be detected as a copy) or touch new.py (which they would not). Thus, you can end up with arc diff showing that empty files were copied from somewhere when users expect that it won't show this, since they don't think of the files as copies even though the content is identical.

(In contrast, in Subversion, the copy operation is explicit.)

A possible "fix" here is to suggest that the Git upstream special case empty files as exempt from being detected as copies. This behavior is probably never useful, since even if an empty file was created with cp, it's unlikely any user ever cares. It's possible they might bite on this.

Another possible "fix" is to add this special case ourselves, although this may be fairly tricky, since we currently try to stay pretty hands-off the actual diff. Other changes to diff generation might position us better here in the future.

Event Timeline

epriestley triaged this task as Wishlist priority.Jul 13 2018, 4:12 PM
epriestley created this task.

As a special case of this, if you commit an empty a.py file, then add content to it and also add a new empty b.py file in a commit on top of it, the new empty b.py will be detected as a copy of a.py based on the previous (empty) content of the file. I think Git is being pretty reasonable/consistent here, but this is potentially also expectation-defying:

epriestley@orbital ~/tmp/scratch $ git init
Initialized empty Git repository in /Users/epriestley/tmp/scratch/.git/
epriestley@orbital ~/tmp/scratch $ touch a.py
epriestley@orbital ~/tmp/scratch $ git add a.py
epriestley@orbital ~/tmp/scratch $ git commit -am a
[master (root-commit) e472b85] a
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 a.py
epriestley@orbital ~/tmp/scratch $ touch b.py
epriestley@orbital ~/tmp/scratch $ touch c.py
epriestley@orbital ~/tmp/scratch $ echo 'import php;' > a.py
epriestley@orbital ~/tmp/scratch $ git add .
epriestley@orbital ~/tmp/scratch $ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	modified:   a.py
	new file:   b.py
	new file:   c.py

epriestley@orbital ~/tmp/scratch $ git diff HEAD -M -C --find-copies-harder
diff --git a/a.py b/a.py
index e69de29..4902695 100644
--- a/a.py
+++ b/a.py
@@ -0,0 +1 @@
+import php;
diff --git a/a.py b/b.py
similarity index 100%
copy from a.py
copy to b.py
diff --git a/a.py b/c.py
similarity index 100%
copy from a.py
copy to c.py

@jcox do you know how to reproduce arc diff dying when you try to create certain types of diffs that move or remove symlinks? I think that's adjacent, if not identical to what's being talked about here.

T1022 is possibly somewhat-vaguely-adjacent on symlink stuff.