Page MenuHomePhabricator

Support downloads of 2GB+ files with HTTPSFuture
Open, WishlistPublic

Description

Previously, see T12907.

Downloading a >2GB file via HTTPSFuture currently fails. Normally this isn't much of an issue since regular users rarely interact with arc in a way that prompts it to download 2GB files, but this operation is required by export and migration operations in the Phacility cluster (see T13393 and PHI1329 for recent examples). We could switch the transfer channel to something else, but this limitation is silly and we should overcome it even if a different channel is ultimately more suitable for these particular operations down the line.

This is largely a combination of two related issues:

  • We hold the entire response in memory.
  • We hold the entire response in memory in a string, and PHP has a 2GB string limit.

The fix is conceptually straightforward: stream responses, include options for buffering/streaming them directly to disk, and probably swap the internal storage from a string to a rope in the case that we aren't disk-buffering.

This is somewhat tricky because the available primitives for doing this are limited and it amounts to requiring we have a streaming HTTP parser. We do now, after D19011, but it's not yet clear if this parser is actually functional or robust.

Event Timeline

My tentative plan is to add methods for sending the output to disk to HTTPSFuture, then go down the new parser pathway only if we're writing to disk. This should limit the amount of surface area exposed on the new parser.

I'm mostly concerned about weird, hard-to-test edge cases with the new parser in the general vein of the HTTP/2 proxy issue in T13391. Swapping the parser out wholesale could lead to hard-to-debug failures where unusual proxy setups aren't handled properly. If the workflow is functionally opt-in for now, we can deal with that at a later date once there's an established body of evidence that the parser works properly in known environments.

This is something of an aside, but it would be nice to formalize PhutilConsoleProgressBar into a generic progress sink. A lot of bin/storage dump-related stuff could use this and bin/host download could obviously use it, and we likely have some use cases for reporting progress to the web via the API, but PhutilConsoleProgressBar lacks an indirection layer to really make this work cleanly.

epriestley added a revision: Restricted Differential Revision.Aug 29 2019, 7:03 PM
epriestley added a commit: Restricted Diffusion Commit.Aug 29 2019, 7:20 PM
epriestley added a revision: Restricted Differential Revision.Aug 30 2019, 2:12 AM
epriestley added a commit: Restricted Diffusion Commit.Aug 30 2019, 2:12 AM

The operation in PHI1329 (against a ~8GB export) went through cleanly. Remaining work here is:

  • D20752 introduced a dynamic timeout to the request, which is required because an 8GB download may take more than the default timeout (of ~300s today). This should applied to arc download if it hasn't been yet, and/or both bin/host download and arc download should use some ArcanistFileDownloader abstraction.
  • HTTPSFuture should be realigned to only use the streaming parser. It seems to work correctly in this case, but I'm not completely confident it gets cases like Location redirects, proxies, and empty responses correct. It's realistic to give it test coverage and include redirect / proxy / empty test cases.
epriestley lowered the priority of this task from Normal to Wishlist.Sep 2 2019, 1:11 PM