Page MenuHomePhabricator

Garbage Collect Old Differential Files
Closed, WontfixPublic

Description

It would seem that when Phabricator is used for C++ development, a LOT of binary files wind up in Files as a result of Differential. Sometimes these need to be sent up (and sometimes --skip-binaries doesn't actually skip .a and .o files, but that's another matter.)

I literally have ten pages of these from the past few months, so it is obviously sucking up disk space! There aren't any batch-delete functions, of course, and deleting them one...at...a...time is too time consuming.

It would be fantastically helpful (and probably earth-shatteringly vital before long) if Phabricator could automatically clean up Differential binaries more than a <time period> old. There has to be a way to mark that they were uploaded via Arcanist, and slate them for removal.

Event Timeline

CodeMouse92 updated the task description. (Show Details)
CodeMouse92 added a subscriber: CodeMouse92.

See:

https://secure.phabricator.com/book/phabcontrib/article/feature_requests/#hypotheticals

probably earth-shatteringly vital before long

How much disk space is this actually consuming? How quickly is it consuming storage?

And as a followup, why are you checking .a and .o files into the repository? This is very unusual, and I would expect you to encounter severe scalability problems with the size of the repository on disk long before you encountered problems with the total data size of stored files.

@epriestley, to answer the first question, we're looking at an average of 1MB total per Diff update, on a small project. With 2-5 diffs per day, you can imagine that adds up fast over several months. It is something we can handle right now, but it will obviously become a problem before long.

Regarding the second question, I'd prefer not to upload .a and .o files at all, but --skip-binaries is NOT working. (Potentially separate issue). There are other binaries, such as images, that need to get sent up, but we don't need several copies floating around forever and eternity.

1MB total per Diff update, on a small project. With 2-5 diffs per day, you can imagine that adds up fast over several months

Can't we compute that it doesn't add up very fast?

1MB / diff * 10 diffs / day * 365 days / year = 3,650 MB per year

The cost to store this data in Amazon S3 is about $0.40 per month. This isn't earth-shatteringly vital from our point of view.

Let's put it this way - that's on a four-file (as I said, small) project. We have projects that will be running closer to 400 files within a matter of months. 10x that amount in your calculation = 365,000MB, or about 365 GB, per year.

Of course, if --skip-binaries worked on .a and .o files, as I said, you could dismiss this for the time being as a moot point. (And yes, I'm running the latest Arcanist).

If arc is trying to upload these files, that means you don't have VCS ignore rules configured for them, and suggests you're checking them in to the repository. If you're doing this, you'll have $4 of storage costs and a 4GB repository a year from now. In your second scenario, you'll have a 365GB repository. How are you planning to work with this repository? Why are you checking these files in?

If arc is diffing ignored files, that's a separate problem. However, I haven't experienced this and can't reproduce it.

*Sigh* Again, I don't want the .a and .o files. I'll check the VCS settings again.

We use these .gitignore rules in libphutil/ to avoid this problem with the .a and .o files generated by XHPAST:

# XHPAST
/support/xhpast/*.a
/support/xhpast/*.o
/support/xhpast/parser.yacc.output
/support/xhpast/node_names.hpp
/support/xhpast/xhpast
/support/xhpast/xhpast.exe
/src/parser/xhpast/bin/xhpast

I'd expect projects which produce these files to have similar rules to avoid this problem.

For example, git itself ignores .a, .o and .s files, as well as .exes and various other build artifacts:

https://github.com/git/git/blob/master/.gitignore#L220

The expectation is that arc respects VCS ignore rules. If you have a case where it isn't, that's a serious problem -- show me how to reproduce it and I'll fix it.

I've updated the ignore rules, and will let you know if this continues. So, I can assume then that arc diff --skip-binaries doesn't actually skip binaries unless Git is configured to ignore the same files? Seems strange, but....OK. *Shrug*

So, I can assume then that arc diff --skip-binaries doesn't actually skip binaries

No, but I can't reproduce this issue.

$ git add -f support/xhpast/libxhpast.a 

$ git commit -am wip
[master d1e9627] WIP
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 support/xhpast/libxhpast.a
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean

$ arc diff HEAD^ --only
Uploaded binary data for "libxhpast.a".
Upload complete.
 PUSH STAGING  Pushing changes to staging area...
Total 0 (delta 0), reused 0 (delta 0)
To ssh://dweller@secure.phabricator.com/diffusion/STAGING/staging.git
 * [new tag]         37b88eb43565a3ee6646638fddde0d2925a2c2a0 -> phabricator/diff/34384
 STAGING PUSHED  Pushed a copy of the changes to tag "phabricator/diff/34384" in the staging area.
Created a new Differential diff:
        Diff URI: https://secure.phabricator.com/differential/diff/34384/

Included changes:
  A (bin) support/xhpast/libxhpast.a

$ arc diff HEAD^ --only --skip-binaries
 PUSH STAGING  Pushing changes to staging area...
Counting objects: 30, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (29/29), done.
Writing objects: 100% (30/30), 193.31 KiB | 0 bytes/s, done.
Total 30 (delta 22), reused 0 (delta 0)
To ssh://dweller@secure.phabricator.com/diffusion/STAGING/staging.git
 * [new tag]         37b88eb43565a3ee6646638fddde0d2925a2c2a0 -> phabricator/diff/34383
 STAGING PUSHED  Pushed a copy of the changes to tag "phabricator/diff/34383" in the staging area.
Created a new Differential diff:
        Diff URI: https://secure.phabricator.com/differential/diff/34383/

Included changes:
  A (bin) support/xhpast/libxhpast.a

In the first diff, without --skip-binaries, note these lines:

Uploaded binary data for "libxhpast.a".
Upload complete.

They are not present in the second diff, with --skip-binaries. So I can't reproduce this issue.

So, it's a Schroedinbug. What fun.

Regardless, however low priority it may be, the originally mentioned feature might still have some use, outside of the .a/.o issue.

epriestley claimed this task.

I'm not aware of any actual problem faced by installs today that the proposed feature is the best solution to -- nor can I imagine such a problem, even hypothetically -- so I don't plan to pursue it.

Particularly, because arc respects VCS ignore rules, any accumulation of data in Differential implies a similar accumulation of data in repositories, but repositories scale far more poorly and at higher cost than file storage. The solution here is always to fix VCS ignore rules, because accumulating this data in repositories is a much larger problem in the long term than accumulating it in Files.