To reproduce:
$ seq 1 50000000 > blob.txt $ git add blob.txt $ git commit -m blob
This is a ~550MB file but it's obviously just a text file, not a binary file. Despite this, the output looks something like this:
$ git show commit dbbffa97701b69ce693b54e7c04fda7d0fadff44 Author: epriestley <git@epriestley.com> Date: Fri May 18 10:22:12 2018 -0700 blob1 diff --git a/blob1.txt b/blob1.txt new file mode 100644 index 0000000..728ec8b Binary files /dev/null and b/blob1.txt differ
I think this is coming out of diff_populate_filespec():
if (s->size > big_file_threshold && s->is_binary == -1) { s->is_binary = 1; return 0; }
...where:
unsigned long big_file_threshold = 512 * 1024 * 1024;
This causes some problems. Notably, an attacker can likely exploit this to slip content by "Content" Herald rules even after all the work we do around trying to handle "enormous changes".
Two possible approaches are to use git diff --binary or git diff --text, although these each have a wide range of other implications.
In either case, Git always detects files as binary if they contain \0 in the first 8000 bytes:
#define FIRST_FEW_BYTES 8000 int buffer_is_binary(const char *ptr, unsigned long size) { if (FIRST_FEW_BYTES < size) size = FIRST_FEW_BYTES; return !!memchr(ptr, 0, size); }
...but you can trivially build a file which, e.g., php will execute that has \0 in the first 8000 bytes:
epriestley@orbital ~/dev/phabricator $ cat effect.php <?php echo 2 + 2; epriestley@orbital ~/dev/phabricator $ head -c1 /dev/zero | cat - effect.php > git-binary.php epriestley@orbital ~/dev/phabricator $ icat git-binary.php \x00<?php echo 2 + 2; \n epriestley@orbital ~/dev/phabricator $ php -f git-binary.php 4
Git believes this is a binary file:
diff --git a/git-binary.php b/git-binary.php new file mode 100644 index 000000000..008383529 Binary files /dev/null and b/git-binary.php differ
So the "attack payload" of adding 2 + 2 is currently invisible to Herald content rules. It remains invisible under --binary:
$ git diff HEAD --binary diff --git a/git-binary.php b/git-binary.php new file mode 100644 index 0000000000000000000000000000000000000000..00838352919e9e6e4d750863cb8b51085146158c GIT binary patch literal 19 acmZRuu`kFdP)JSA$X75@&{i<A<^ljS(FC^u literal 0 HcmV?d00001
...although we could decode the new file and pass that to Herald.
It is visible under --text, but this implies some tradeoffs since legitimate binary files turn into a mess:
$ git diff HEAD --text diff --git a/git-binary.php b/git-binary.php new file mode 100644 index 000000000..008383529 --- /dev/null +++ b/git-binary.php @@ -0,0 +1 @@ +^@<?php echo 2 + 2;
Possibly we should just give up here and say "content rules aren't security rules, there's no way we can actually keep anything out of the repository if the attacker is dedicated". I'm not aware of any installs really relying on content rules in this role.
But it seems like we can probably do better than we're currently doing without going quite that far, perhaps.