Page MenuHomePhabricator

Remove the `bin/files purge` workflow
Closed, ResolvedPublic

Description

It seems that bin/files purge fails when trying to run on large file uploads. In the example below, F125732 is a 1GB gzipped tarball.

/usr/local/src/phabricator/bin/files --trace purge F125732
>>> [2] <connect> phabricator_file
<<< [2] <connect> 3,361 us
>>> [3] <query> SELECT `f`.* FROM `file` f  WHERE (f.id IN (125732))   ORDER BY `f`.`id` DESC 
<<< [3] <query> 632 us
>>> [4] <query> SELECT edge.* FROM `edge` edge WHERE (edge.src IN ('PHID-FILE-jv4mivvrsnubujfig6vj')) AND (edge.type IN ('26')) ORDER BY edge.dateCreated DESC, edge.seq DESC
<<< [4] <query> 648 us
>>> [5] <query> SELECT * FROM `file_transformedfile` WHERE transformedPHID IN ('PHID-FILE-jv4mivvrsnubujfig6vj') 
<<< [5] <query> 615 us
>>> [6] <s3> 
<<< [6] <s3> 29,027,285 us
>>> [7] <kvcache-get> 
>>> [8] <connect> phabricator_cache
<<< [8] <connect> 2,070 us
>>> [9] <query> SELECT * FROM `cache_general` WHERE cacheKeyHash IN ('BnTlrCiB63YN')
<<< [9] <query> 653 us
<<< [7] <kvcache-get> 4,065 us
>>> [10] <kvcache-get> 
>>> [11] <query> SELECT * FROM `cache_general` WHERE cacheKeyHash IN ('BnTlrCiB63YN')
<<< [11] <query> 658 us
<<< [10] <kvcache-get> 921 us

mmap() failed: [12] Cannot allocate memory

mmap() failed: [12] Cannot allocate memory
PHP Fatal error:  Out of memory (allocated 1740173312) (tried to allocate 1721295440 bytes) in /usr/local/src/phabricator/src/applications/files/engine/PhabricatorFileStorageEngine.php on line 354

Fatal error: Out of memory (allocated 1740173312) (tried to allocate 1721295440 bytes) in /usr/local/src/phabricator/src/applications/files/engine/PhabricatorFileStorageEngine.php on line 354

Revisions and Commits

Event Timeline

epriestley triaged this task as Wishlist priority.Aug 2 2017, 2:53 PM
epriestley added a subscriber: epriestley.

We should probably just remove this workflow, it's not clear to me that there's ever any reason for anyone to run bin/files purge in the modern codebase.

epriestley renamed this task from `bin/files purge` can run out of memory for large files to Remove the `bin/files purge` workflow.Aug 3 2017, 2:53 PM

I wonder if this OOM error can also be hit in other workflow though, given that it seems to occur in PhabricatorFileStorageEngine::getRawFileDataIterator.

I believe the location of the OOM is a little misleading -- the real problem is loadFileData(). This returns the entire file as a string, and will thus:

  • Always OOM if the file is larger than the configured memory limit.
  • Always fail if the file is larger than 2GB, because PHP does not support strings larger than 2GB.

Exactly where this OOM happens is somewhat arbitrary, but it will certainly happen in PhabricatorFile->loadDataFromIterator() (called from loadFileData()) if it doesn't happen before then, since that method will attempt to put, e.g., 3GB of data into a single string variable if a file is 3GB in size. This certainly won't work: it's bigger than the 2GB string limit, and will hit the memory limit first if the memory limit is smaller than 2GB.

I don't believe there's an issue with actual iteration-based access to file data, which should (and, I believe, does) use a constant amount of memory regardless of the file size (roughly 8MB for files of arbitrary size). But there is definitely a problem with calling loadFileData() on large files, because the API for this method ("return the whole data as a single string") fundamentally can never work with arbitrarily large files.

You can verify that the memory usage behavior is reasonable with iteration-based access to file data like this:

  • Upload a large file (note that it must use the chunked storage engine -- a very old file like this one may be stored as a single giant blob).
  • Make a change to bin/files cat (which uses iteration-based access) like this, so it reports memory usage:
diff --git a/src/applications/files/management/PhabricatorFilesManagementCatWorkflow.php b/src/applications/files/management/PhabricatorFilesManagementCatWorkflow.php
index 70d5eca5a2..ce2e87cc4d 100644
--- a/src/applications/files/management/PhabricatorFilesManagementCatWorkflow.php
+++ b/src/applications/files/management/PhabricatorFilesManagementCatWorkflow.php
@@ -60,9 +60,12 @@ final class PhabricatorFilesManagementCatWorkflow
       $file->setIntegrityHash(null);
     }
 
+    $this->reportMemory('Begin');
+
     try {
       $iterator = $file->getFileDataIterator($begin, $end);
       foreach ($iterator as $data) {
+        $this->reportMemory('Writing Data');
         echo $data;
       }
     } catch (PhabricatorFileIntegrityException $ex) {
@@ -74,7 +77,18 @@ final class PhabricatorFilesManagementCatWorkflow
           $ex->getMessage()));
     }
 
+    $this->reportMemory('Done');
+
     return 0;
   }
 
+  private function reportMemory($where) {
+    fprintf(
+      STDERR,
+      "Usage %s / %s @ %s\n",
+      phutil_format_bytes(memory_get_usage()),
+      phutil_format_bytes(memory_get_peak_usage()),
+      $where);
+  }
+
 }
  • Run bin/files cat Fxxxx | wc -c. You're looking for: memory usage reported on STDERR, and wc reporting that it saw a very large number of bytes (the entire file) on stdout after execution finishes.

I get output like this:

$ ./bin/files cat F31236 | wc -c
Usage 13 MB / 13 MB @ Begin
Usage 26 MB / 30 MB @ Writing Data
Usage 26 MB / 34 MB @ Writing Data
<...snip many identical lines...>
Usage 26 MB / 34 MB @ Writing Data
Usage 25 MB / 34 MB @ Writing Data
Usage 20 MB / 34 MB @ Writing Data
Usage 20 MB / 34 MB @ Done
 300000000

That is, PHP peaked at 34MB of memory usage and held steady at 26MB for most of its execution, but wc saw ~300MB of data. I think this is consistent with the claim that iteration-based access uses a constant amount of memory, independent of the size of the file being iterated over.

Since I have encryption enabled I think we expect the amount of memory used to be about 16MB (8MB for the buffer that holds the raw data from the engine, and 8MB for the buffer that holds the decrypted version) and this is in the ballpark of what we actually see (13MB peak for the runtime before we do any file stuff, rising 21MB to 34MB once we start doing file stuff). The 21MB is a little more than the 16MB I'd predict, but it's possible we have an extra copy of a buffer somewhere and could shave this by 4MB or 8MB or something, and maybe the theoretical peak for iteration-based access is really 24MB or 32MB right now, but theory and practice seem to mostly be aligned pretty well in this experiment.

In contrast, the loadFileData() method does not iterate, and does not use constant memory. However, not every call to loadFileData() is necessarily bad. For example, PhabricatorFileImageTransform calls it after checking that the data is no larger than 16MB. The chunk engine calls it to reconstruct a file from chunks, but knows that each chunk is small (less than 8MB). Many other callsites probably have reasonable behavior or cause no issues in practice, or are in legacy code that will be removed elsewhere and isn't worth cleaning up on its own (like T5258). We could go through each loadFileData() callsite and try to classify them and then fix however many of them could have bad behavior in the case of 2GB+ files, but I suspect this isn't worthwhile.

For example, this callsite is "wrong", in the sense that trying to view a 3GB XHProf profile will OOM:

https://secure.phabricator.com/source/phabricator/browse/master/src/applications/xhprof/controller/PhabricatorXHProfProfileController.php;71eaf3e8c4d61203ab2887ab1a294ea81bd6ae83$21

But we can't reasonably process such a profile in a stream-oriented way or hold the profile in memory anyway, and it's probably unrealistic to imagine we'd ever make this work (it represents a tremendous amount of effort to solve a problem no one experiences). Instead, we could make this controller test for files over some limit and raise a more tailored error before trying to load the file data. But this doesn't seem worthwhile to prioritize given that no one has ever encountered a problem with this and it seems reasonable to imagine no one "ever" will, for small values of "ever".

When we add arbitrary limits to things we tend to get feedback that the limit -- or even the idea that we'd have a limit -- is bad, too (for example, see: T8612, T7895). I mostly haven't been successful in convincing users that any limit we add to the software which anyone can actually hit is a reasonable limit. I think no one sets out to be unreasonable and when they hit a limit it feels like the software is working against them instead of with them, no matter how compelling the technical argument is for having limits.

In this specific case, I believe this (using iteration, similar to files cat ...) would fix the immediate problem (with a caveat, see below):

-        $file->loadFileData();
+        foreach ($file->getFileDataIterator() as $piece) {
+          // We're just making sure we can read all the pieces.
+        }

However, since I think the purge workflow has no real reason to exist in the modern codebase, a more complete fix is to remove it. The addition of new modes of data load failure have also made it more dangerous than it once was. Some ways it could delete useful data today are:

  • If a file is chunked and any individual chunk is bad, it will delete all chunks. However, a majority of the data might be recoverable if 7/8 chunks are good and only one chunk has been lost.
  • If the encryption keyring is misconfigured, it will delete all encrypted files. However, data might be completely recoverable by fixing the keyring.
  • If a storage engine is misconfigured, it will delete the metadata for files which may have data intact. This has always been an issue, but in the modern codebase recovery would become impossible if the data is stored encrypted (the metadata must be present to decrypt it).
  • In all cases, it could delete otherwise accessible data after a transient error (e.g., a momentary S3 service disruption or rate limit error could cause it to delete metadata or actual data). This was always a risk, but, as above, the risk is a bit greater now that metadata must be retained to decrypt files since it would make an even bigger mess than before.

A caveat on all of this is that it looks like the specific file you're working might be old and stored as a single 1GB chunk in S3, based on the single 30-second call to S3 in the --trace. If so, it predates the chunk engine and must have been uploaded in an environment with a larger memory limit (note that we explicitly remove any configured memory limit in the web workflow, so this is expected if it was uploaded via Conduit or the web UI). Thus, the iteration change above (for example) won't work in this case because you'll still just get a single 1GB piece. You may be able to do this to re-chunk the file:

  • Remove the CLI memory limit in configuration.
  • Use bin/files encode to move the file to rot13 encoding.
  • Use bin/files encode to move it back to raw encoding.

This should leave you with a raw-encoded, chunked file. However, this is an off-label use of bin/files encode; there is currently no supported method for re-chunking older files.

(This kind of question, where an "I wonder..." takes me more than an hour to respond to and benefits only one user/install, is also part of why we're tightening the rules around support.)

Thanks for the detailed response, I certainly didn't expect it.