Support (but do not actually enable) a maximum file size limit for Git repositories
ClosedPublic
Actions

Authored by epriestley on Nov 16 2018, 6:15 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T13216: 2018 Week 45-47 Bonus Content

Commits

rPa0d4b6da4b60: Support (but do not actually enable) a maximum file size limit for Git…

Summary

Depends on D19816. Ref T13216. See PHI908. See PHI750. In a few cases, users have pushed multi-gigabyte files full of various things that probably shouldn't be version controlled. This tends to create various headaches.

Add support for limiting the maximum size of any object. Specifically, we:

list all the objects each commit touches;
check their size after the commit applies;
if it's over the limit, reject the commit.

This change doesn't actually hook the limit up (the limit is always "0", i.e. unlimited), and doesn't have Mercurial or SVN support. The actual parser bit would probably be better in some other Query/Parser class eventually, too. But it at least roughly works.

Test Plan

Changed the hard-coded limit to other values, tried to push stuff, got sensible results:

$ echo pew >> magic_missile.txt && git commit -am pew && git push
[master 98d07af] pew
 1 file changed, 1 insertion(+)
# Push received by "local.phacility.net", forwarding to cluster host.
# Acquiring write lock for repository "spellbook"...
# Acquired write lock immediately.
# Acquiring read lock for repository "spellbook" on device "local.phacility.net"...
# Acquired read lock immediately.
# Device "local.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local.phacility.net".
Counting objects: 49, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (48/48), done.
Writing objects: 100% (49/49), 3.44 KiB | 1.72 MiB/s, done.
Total 49 (delta 30), reused 0 (delta 0)
remote: +---------------------------------------------------------------+
remote: |      * * * PUSH REJECTED BY EVIL DRAGON BUREAUCRATS * * *     |
remote: +---------------------------------------------------------------+
remote:              \
remote:               \                    ^    /^
remote:                \                  / \  // \
remote:                 \   |\___/|      /   \//  .\
remote:                  \  /V  V  \__  /    //  | \ \           *----*
remote:                    /     /  \/_/    //   |  \  \          \   |
remote:                    @___@`    \/_   //    |   \   \         \/\ \
remote:                   0/0/|       \/_ //     |    \    \         \  \
remote:               0/0/0/0/|        \///      |     \     \       |  |
remote:            0/0/0/0/0/_|_ /   (  //       |      \     _\     |  /
remote:         0/0/0/0/0/0/`/,_ _ _/  ) ; -.    |    _ _\.-~       /   /
remote:                     ,-}        _      *-.|.-~-.           .~    ~
remote:   *     \__/         `/\      /                 ~-. _ .-~      /
remote:    \____(Oo)            *.   }            {                   /
remote:    (    (..)           .----~-.\        \-`                 .~
remote:    //___\\  \ DENIED!  ///.----..<        \             _ -~
remote:   //     \\                ///-._ _ _ _ _ _ _{^ - - - - ~
remote: 
remote: 
remote: OVERSIZED FILE
remote: This repository ("spellbook") is configured with a maximum individual file size limit, but you are pushing a change ("98d07af863e799509e7c3a639404d216f9fc79c7") which causes the size of a file ("magic_missile.txt") to exceed the limit. The commit makes the file 317 bytes long, but the limit for this repository is 1 bytes.
remote: 
# Released cluster write lock.
To ssh://local.phacility.com/source/spellbook.git
 ! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'ssh://epriestley@local.phacility.com/source/spellbook.git'

Diff Detail

Repository

rP Phabricator

Branch

reposize1

Lint

Lint Passed

Severity	Location	Code	Message
Advice	src/applications/diffusion/engine/DiffusionCommitHookEngine.php:1271	XHP16	TODO Comment

Unit

Tests Passed

Build Status

Buildable 21176
Build 28794: Run Core Tests
Build 28793: arc lint + arc unit

Event Timeline

epriestley created this revision.Nov 16 2018, 6:15 PM

Harbormaster completed remote builds in B21176: Diff 47325.Nov 16 2018, 6:16 PM

epriestley requested review of this revision.Nov 16 2018, 6:16 PM

Some testing comments:

Commits that touch a ton of files? I just wonder if there's any limit to the input that git diff-tree and git cat-file will accept.
Commits that delete files? Not sure if these commands do anything goofy on deleted files, and in theory they should still work since the underlying objects should still exist in the object DB. As a micro-optimization we could skip checking deletes entirely by (I think) checking the "change mode" output from git diff-tree.
Races on git repack? Are we worried about git commands that reorganize the object DB on disk running concurrently with these commands (see question about objectsize vs objectsize:disk).

src/applications/diffusion/engine/DiffusionCommitHookEngine.php
1367	We definitely want `objectsize` and not `objectsize:disk`? From the docs: Note that the sizes of objects on disk are reported accurately, but care should be taken in drawing conclusions about which refs or objects are responsible for disk usage. The size of a packed non-delta object may be much larger than the size of objects which delta against it, but the choice of which object is the base and which is the delta is arbitrary and is subject to change during a repack. Note also that multiple copies of an object may be present in the object database; in this case, it is undefined which copy’s size will be reported. From that info, it seems like using `objectsize` will allow small changes to existing huge objects to go through (which is arguably a good thing in the interest of backwards compatibility with existing huge objects?), while using `objectsize:disk` will start throwing exceptions on any change to an existing large object.
1384	Does `git cat-file` promise that output will be ordered identically to input? I guess it would have to pretty pathological to take paths via `STDIN` and then permute the output ordering, though.
1392–1395	I'm not wild about an 80-line `case` statement. Maybe just `throw` at the top of the function and avoid this level of indentation entirely? I usually like to separate control flow from complicated logic like the guts of this function.

Commits that touch a ton of files? I just wonder if there's any limit to the input that git diff-tree and git cat-file will accept.

I think this will first break when the total output size of git diff-tree is larger than 2GB and we hit the PHP 2GB string size limit.

This can be a single file (see T10832 for a previous Git vulnerability with 2GB paths) but it's unlikely that a 2GB path is legitimate.

For normal path lengths (producing ~512 byte lines) this is ~4M files in a single commit. It seems unlikely that we'll encounter this. If we do, the cause (or, at least, the reproduction case) should be reasonably obvious from examining the commit and seeing that it touches 4M files. We could switch to LinesOfALargeExecFuture to work around this and probably increase the limit to "however much memory the machine has", or we could add an output limit and throw if a commit has more than 1GB of paths AND has filesize limits enabled ("Touch fewer paths or disable filesize limits to push this unusually large change.")

(Maybe I'll just switch this to LinesOfALargeExecFuture when I move it into a separate LowLevelQuery anyway since it's not an especially complicated change, although the internal \0 delimiting is slightly awkward.)

Commits that delete files?

These work properly already, the new file gets the empty hash (0000...0000) and we explicitly short circuit this into a filesize of 0 bytes (there's no particular reason for this specific behavior right now -- it just makes sure that we never hit any filesize limit for deleted files -- but if we later add a "maximum number of files a commit may touch" limit or something it might be useful to care about this in some way).

Are we worried about git commands that reorganize the object DB on disk running concurrently with these commands?

I don't believe this is a concern.

I think the objectsize stuff is talking about using %(objectsize) when the object is a commit. Commits don't have an unambiguous "size" and I'm guessing (?) that their "size" is some sort of amalgamation of things that are size-like plus some hand waving.

The objectsize value when the object is a blob consistently appears to be the number of bytes in the file, as we'd expect. In my local working copy of Phabricator, every blob has the same size on disk and %(objectsize) (I'm skipping symlinks because filesize() resolves them, I could probably use stat() or something instead to get their disk size accuately). The value of %(objectsize:disk) is some kind of weird internal Git artifact:

php

<?php

require_once 'scripts/init/init-script.php';

list($paths) = execx('git ls-tree -r HEAD | awk %s', '{print $3 " " $4}');
$paths = trim($paths);
$paths = explode("\n", $paths);

foreach ($paths as $path) {
  list($hash, $path_name) = explode(' ', $path, 2);
  $disk_size = filesize($path_name);

  if (is_link($path_name)) {
    continue;
  }

  list($git_sizes) = execx(
    'echo %s | git cat-file --batch-check=%s',
    $hash,
    '%(objectsize) %(objectsize:disk)');
  $git_sizes = trim($git_sizes);
  list($git_object_size, $git_disk_size) = explode(' ', $git_sizes, 2);

  echo sprintf(
   "%4s % 16d % 16d % 16d %s\n",
   ((int)$disk_size === (int)$git_object_size ? '' : '****'),
   $disk_size,
   $git_object_size,
   $git_disk_size,
   $path_name);
}

...produces:

$ php -f test.php 
                  109              109               98 .arcconfig
                 1814             1814              132 .arclint
                   98               98               75 .arcunit
                  678              678              287 .editorconfig
                  858              858              453 .gitignore
                10173            10173               24 LICENSE
                  739              739              378 NOTICE
                 1221             1221              246 README.md
                 2010             2010              145 conf/__init_conf__.php
                  713              713              349 conf/aphlict/README
                  450              450              182 conf/aphlict/aphlict.default.json
                    0                0                9 conf/keys/.keep
                   44               44               54 conf/local/README
                 1584             1584              789 externals/JsShrink/jsShrink.php
                  695              695              409 externals/JsShrink/readme.txt
                22840            22840              689 externals/amazon-ses/ses.php
                  931              931              578 externals/cowsay/ChangeLog
                  385              385              262 externals/cowsay/INSTALL
                 1116             1116              610 externals/cowsay/LICENSE
                  445              445              286 externals/cowsay/MANIFEST
                 1610             1610              894 externals/cowsay/README
                  879              879              466 externals/cowsay/Wrap.pm.diff
                  123              123               99 externals/cowsay/cows/bunny.cow
...<many more lines of output with no differences between disk size and `%(objectsize)`> ...

it seems like using objectsize will allow small changes to existing huge objects to go through

I think this would be bad (even if the behavior itself isn't objectively "bad", it's definitely confusing/unexpected) but, consistent with the behavior of %(objectsize) above, it doesn't appear to be what happens in practice. %(objectsize) always reports the number of bytes in the file so making a 1-byte change to a 1GB file produces an error if the filesize limit is <1GB. Note that you can still remove the file, and your changes goes through if you delete enough bytes to get the file under the limit. You just aren't allowed to end up with any 1GB+ files on disk after your change applies.

One case we might want to refine is that we could let you touch file xyz.c if the old version of xyz.c was already over the limit: once you sneak a 1GB file into the repository, you might be allowed to keep modifying it. But I'd like to see a really, really good use case for this before adding support.

src/applications/diffusion/engine/DiffusionCommitHookEngine.php
1384	The manpage doesn't explicitly promise that it's ordered as far as I can tell, but it is in practice, and this sort of use case is clearly the primary/intended use case. We've used `cat-file --batch-check` elsewhere for years, also assuming order, without issues.
1392–1395	Yeah, I anticipate moving this to a `LowLevelQuery` sort of class in the next change or two.