Page MenuHomePhabricator

Git treats large text files as binary, which probably has security implications around audit and "enormous changes"
Open, LowPublic


To reproduce:

$ seq 1 50000000 > blob.txt
$ git add blob.txt
$ git commit -m blob

This is a ~550MB file but it's obviously just a text file, not a binary file. Despite this, the output looks something like this:

$ git show
commit dbbffa97701b69ce693b54e7c04fda7d0fadff44
Author: epriestley <>
Date:   Fri May 18 10:22:12 2018 -0700


diff --git a/blob1.txt b/blob1.txt
new file mode 100644
index 0000000..728ec8b
Binary files /dev/null and b/blob1.txt differ

I think this is coming out of diff_populate_filespec():

if (s->size > big_file_threshold && s->is_binary == -1) {
	s->is_binary = 1;
	return 0;


unsigned long big_file_threshold = 512 * 1024 * 1024;

This causes some problems. Notably, an attacker can likely exploit this to slip content by "Content" Herald rules even after all the work we do around trying to handle "enormous changes".

Two possible approaches are to use git diff --binary or git diff --text, although these each have a wide range of other implications.

In either case, Git always detects files as binary if they contain \0 in the first 8000 bytes:

#define FIRST_FEW_BYTES 8000
int buffer_is_binary(const char *ptr, unsigned long size)
	if (FIRST_FEW_BYTES < size)
	return !!memchr(ptr, 0, size);

...but you can trivially build a file which, e.g., php will execute that has \0 in the first 8000 bytes:

epriestley@orbital ~/dev/phabricator $ cat effect.php 
<?php echo 2 + 2;
epriestley@orbital ~/dev/phabricator $ head -c1 /dev/zero | cat - effect.php > git-binary.php
epriestley@orbital ~/dev/phabricator $ icat git-binary.php 
\x00<?php echo 2 + 2; \n 

epriestley@orbital ~/dev/phabricator $ php -f git-binary.php 

Git believes this is a binary file:

diff --git a/git-binary.php b/git-binary.php
new file mode 100644
index 000000000..008383529
Binary files /dev/null and b/git-binary.php differ

So the "attack payload" of adding 2 + 2 is currently invisible to Herald content rules. It remains invisible under --binary:

$ git diff HEAD --binary
diff --git a/git-binary.php b/git-binary.php
new file mode 100644
index 0000000000000000000000000000000000000000..00838352919e9e6e4d750863cb8b51085146158c
GIT binary patch
literal 19

literal 0

...although we could decode the new file and pass that to Herald.

It is visible under --text, but this implies some tradeoffs since legitimate binary files turn into a mess:

$ git diff HEAD --text
diff --git a/git-binary.php b/git-binary.php
new file mode 100644
index 000000000..008383529
--- /dev/null
+++ b/git-binary.php
@@ -0,0 +1 @@
+^@<?php echo 2 + 2;

Possibly we should just give up here and say "content rules aren't security rules, there's no way we can actually keep anything out of the repository if the attacker is dedicated". I'm not aware of any installs really relying on content rules in this role.

But it seems like we can probably do better than we're currently doing without going quite that far, perhaps.