Unable to download large files through alternative file domain
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	joshuaspence
	Jul 9 2017, 11:55 PM

Description

I'm playing around with setting up an alternate file domain for our Phabricator installation using CloudFront. Whilst doing so, I uploaded a dummy 1GB file to this installation and attempted to download the file immediately afterwards using arc download. My reason for doing so was in an attempt to understand whether arbitrary files would be served through the CDN. In doing so, however, I hit an exception:

> arc download --conduit-uri https://secure.phabricator.com --trace F5039656
 ARGV  '/home/josh/workspace/github.com/phacility/arcanist/bin/../scripts/arcanist.php' 'download' '--conduit-uri' 'https://secure.phabricator.com' '--trace' 'F5039656'
 LOAD  Loaded "phutil" from "/home/josh/workspace/github.com/phacility/libphutil/src".
 LOAD  Loaded "arcanist" from "/home/josh/workspace/github.com/phacility/arcanist/src".
Config: Reading user configuration file "/home/josh/.arcrc"...
Config: Did not find system configuration at "/etc/arcconfig".
Working Copy: Reading .arcconfig from "/home/josh/workspace/REDACTED/.arcconfig".
Working Copy: Path "/home/josh/workspace/REDACTED" is part of `git` working copy "/home/josh/workspace/REDACTED".
Working Copy: Project root is at "/home/josh/workspace/REDACTED".
Config: Did not find local configuration at "/home/josh/workspace/REDACTED/.git/arc/config".
Loading phutil library from '/home/josh/workspace/REDACTED/support/flarc/src'...
>>> [0] <conduit> user.whoami() <bytes = 117>
>>> [1] <http> https://secure.phabricator.com/api/user.whoami
<<< [1] <http> 777,690 us
<<< [0] <conduit> 778,620 us
>>> [2] <conduit> file.search() <bytes = 171>
>>> [3] <http> https://secure.phabricator.com/api/file.search
<<< [3] <http> 252,815 us
<<< [2] <conduit> 252,933 us
 DATA  Downloading "F5039656" (1,073,741,824 byte(s))...
>>> [4] <http> https://p.phcdn.net/file/data/@secure/vnfphk6aw5b4pl52nu7l/PHID-FILE-gjgmogxhmwkz6ftww7ru/test.img
<<< [4] <http> 43,564,789 us
Getting file information...
>>> [5] <conduit> file.info() <bytes = 138>
>>> [6] <http> https://secure.phabricator.com/api/file.info
<<< [6] <http> 759,008 us
<<< [5] <conduit> 759,176 us
Downloading file 'test.img' (1,073,741,824 bytes)...
>>> [7] <conduit> file.download() <bytes = 169>
>>> [8] <http> https://secure.phabricator.com/api/file.download
<<< [8] <http> 13,098,890 us
<<< [7] <conduit> 13,099,005 us

[2017-07-09 23:48:44] EXCEPTION: (HTTPFutureHTTPResponseStatus) [HTTP/500] Internal Server Error
>>> UNRECOVERABLE FATAL ERROR <<<

Out of memory (allocated 5752750080) (tried to allocate 1431655909 bytes)

/core/lib/libphutil/src/utils/utils.php:1168


┻━┻ ︵ ¯\_(ツ)_/¯ ︵ ┻━┻ at [<phutil>/src/future/http/BaseHTTPFuture.php:339]
arcanist(head=master, ref.master=3c4735795a29), flarc(), phutil(head=stable, ref.master=310a831f97ab, ref.stable=62a2730aefc0)
  #0 BaseHTTPFuture::parseRawHTTPResponse(string) called at [<phutil>/src/future/http/HTTPSFuture.php:422]
  #1 HTTPSFuture::isReady() called at [<phutil>/src/future/Future.php:37]
  #2 Future::resolve(NULL) called at [<phutil>/src/future/FutureProxy.php:34]
  #3 FutureProxy::resolve() called at [<phutil>/src/conduit/ConduitClient.php:64]
  #4 ConduitClient::callMethodSynchronous(string, array) called at [<arcanist>/src/workflow/ArcanistDownloadWorkflow.php:264]
  #5 ArcanistDownloadWorkflow::run() called at [<arcanist>/scripts/arcanist.php:394]

Revisions and Commits

rPHU libphutil
		D19011	rPHUe1f7cfefe880 Provide a streaming HTTP response parser
rP Phabricator
	Changes Planned		D19548 Possibly increase the width of the S3 -> Files -> Client pipeline
		D19547	rP58e8d3c13479 (stable) Remove the execution time limit (if any) before sinking HTTP responses
		D19547	rP81a7c9ac43f2 Remove the execution time limit (if any) before sinking HTTP responses

Related Objects

Mentioned In: T13396: Support downloads of 2GB+ files with HTTPSFuture
T13393: Improve repository shard migration pathway in the shared cluster
T13164: Plans: 2018 Week 31 - 33 Bonus Content
T12999: Replace cluster magnetic volumes with SSD volumes
T4369: Phabricator HTTP repository hosting has fairly severe scalability limits
T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table
Mentioned Here: T13352: Very large objects may not be distributable via CDN (CloudFront has 20GB object limit)
T13396: Support downloads of 2GB+ files with HTTPSFuture
T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table

Event Timeline

joshuaspence created this task.Jul 9 2017, 11:55 PM

Herald added a subscriber: chad. · View Herald TranscriptJul 9 2017, 11:55 PM

This is not a test install.

In T12907#228951, @chad wrote:

This is not a test install.

I apologize. I though it was okay to use this install for reproducing issues. From Contributing Bug Reports:

Can you reproduce the issue on secure.phabricator.com?

Were you reproducing a local issue? That wasn't my impression, unless I'm misreading it. It seemed specific to secure.

Ah sorry, I should've made that clear. Yes, I am able to reproduce the issue on our installation.

👍🏻

joshuaspence renamed this task from Unable to download large files from secure.phabricator.com to Unable to download large files through alternative file domain.Jul 10 2017, 12:12 AM

epriestley removed a project: Phacility.Jul 10 2017, 1:53 AM

epriestley mentioned this in T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table.Oct 2 2017, 3:14 PM

In the log above, this download should have worked:

>>> [4] <http> https://p.phcdn.net/file/data/@secure/vnfphk6aw5b4pl52nu7l/PHID-FILE-gjgmogxhmwkz6ftww7ru/test.img
<<< [4] <http> 43,564,789 us

It didn't, and we fell back to file.download, which failed more explosively. That's expected, since the old method is bad and the new one is expected to fix this.

If I download the file locally with curl, I get sporadic errors somewhere in the download:

curl: (56) SSLRead() return error -9806

However, this is apparently some kind of Apple + cURL issue, and wget downloads it successfully. Likewise, curl from secure001 downloads it successfully, as does curl from secure001 using a host header and the local IP address.

If I change arc download to log the exception from the plain fetch, I get this, which is cURL raising the same error internally:

[2017-10-05 19:34:26] EXCEPTION: (Exception) Got HTTP 56 status response, expected HTTP 200. at [<arcanist>/src/workflow/ArcanistDownloadWorkflow.php:187]
arcanist(head=experimental, ref.master=76b54ce0a9af, ref.experimental=ba12102eb6ee), phutil(head=stable, ref.master=9f9c33797a3e, ref.stable=8f4eee3bfbbb)
  #0 phlog(Exception) called at [<arcanist>/src/workflow/ArcanistDownloadWorkflow.php:245]
  #1 ArcanistDownloadWorkflow::run() called at [<arcanist>/scripts/arcanist.php:420]

Basically every post on the internet says the "fix" for this is to reinstall curl built against OpenSSL instead of SecureTransport, often using brew. I didn't try that, but it probably works.

I'm still on Yosemite so maybe this is fixed in newer versions, and since the path in the original post is /home (Linux) rather than /Users (macOS nee OS X) I'd guess this isn't the problem, but I can't reproduce anything else. Perhaps useful would be the output of:

$ curl -v https://p.phcdn.net/file/data/@secure/vnfphk6aw5b4pl52nu7l/PHID-FILE-gjgmogxhmwkz6ftww7ru/test.img > test.img
$ wget https://p.phcdn.net/file/data/@secure/vnfphk6aw5b4pl52nu7l/PHID-FILE-gjgmogxhmwkz6ftww7ru/test.img

This is probably not the issue I hit in T13000 with pulling data off admin.

Pulling stuff from admin.phacility.com via Cloudfront, I get this instead:

curl: (18) transfer closed with 10195795 bytes remaining to read

If I use curl -v -H 'Host: blah.cloudfront.net' http://<ip>/ instead, it works fine, which suggests this is a Cloudfront or LB issue.

If I use curl -v -H 'Host: blah.cloudfront.net' https://admin.phacility.com/ instead, it also works fine, which suggests Cloudfront specifically.

However, Cloudfront claims to work for objects up to 20GB here: https://aws.amazon.com/blogs/aws/amazon-cloudfront-support-for-20-gb-objects/

That's talking more about caching I think, but if it can cache 20GB files it must also be able to serve them in the first place.

wget also gets interrupted, although it is able to retry with a Range header and complete the download. We could reasonably adopt this strategy.

It doesn't seem like we're being killed by an explicit timeout, since the amount of time before the connection dies is not consistent (nor are the number of bytes read).

I got two failures around 1m 30s but I don't think they were all around there. I cleared my scrollback so I can't check.

It's also working fine now, but the download is only taking 1m4s.

I bumped the "Origin Response Timeout" for admin.phacility.com to the maximum value (60s) since that's just about the only quasi-related value available to configure. Conceivably, the old value (30s) might possibly have been hit if S3 was really slow when reading some chunks of a file? And now S3 is warm so the download is going faster and not failing? Although the edit is still "In Progress" so I don't know if it has taken effect yet or not.

If I use curl --limit-rate 10M ... to force the download to go slower again, it now works properly even though it takes several minutes.

So basically it was haunted or something. If we continue seeing this, we could probably make the download flow retry with a "Range" header like wget does, since that appeared to work.

iiam

For consistency, I gave the content Cloudfront distribution a 60-second timeout as well.

On the bin/files download side of things, we currently buffer the whole file into a string and then hit this:

PHP Fatal error:  Out of memory (allocated 3850108928) (tried to allocate 18446744073259815775 bytes) in /core/lib/libphutil/src/future/http/HTTPSFuture.php on line 417
Fatal error: Out of memory (allocated 3850108928) (tried to allocate 18446744073259815775 bytes) in /core/lib/libphutil/src/future/http/HTTPSFuture.php on line 417

This is here, where it tries to find/replace a 2GB body:

$result = preg_replace('/^(.*\r\n\r\n){'.$redirects.'}/sU', '', $result);

This needs to become a stream-oriented parser; ideally one that can stream files to disk.

HTTPSFuture already implements read() and discardBuffers() so it could be implemented as a FutureProxy, but (a) these methods probably don't actually work and (b) that seems kind of silly and obnoxious.

epriestley mentioned this in T4369: Phabricator HTTP repository hosting has fairly severe scalability limits.Oct 10 2017, 4:31 PM

epriestley added a revision: D19011: Provide a streaming HTTP response parser.Feb 7 2018, 12:14 AM

epriestley added a commit: rPHUe1f7cfefe880: Provide a streaming HTTP response parser.Feb 16 2018, 7:09 PM

epriestley mentioned this in T12999: Replace cluster magnetic volumes with SSD volumes.Apr 10 2018, 10:47 PM

amckinley added a subscriber: amckinley.Apr 11 2018, 6:10 PM

A recent, probably more concrete issue that I've seen is that wget of a large file can succeed (wget automatically resumes) but we can end up with a file that has different content than the original, which is disastrously bad.

I'm assuming this is an issue with Range headers rather than a CloudFront issue. I'm going to see if I can reproduce it and narrow it down, as it's way worse than the interruption/resume issue.

I made a change like this to split responses into small blocks and give each a random chance of failing (I tried 4096 and 4027, a prime near 4KB, as miniblock sizes):

diff --git a/src/aphront/sink/AphrontHTTPSink.php b/src/aphront/sink/AphrontHTTPSink.php
index 11c6466cf..7b41b2393 100644
--- a/src/aphront/sink/AphrontHTTPSink.php
+++ b/src/aphront/sink/AphrontHTTPSink.php
@@ -114,11 +114,18 @@ abstract class AphrontHTTPSink extends Phobject {
 
     $abort = false;
     foreach ($data as $block) {
-      if (!$this->isWritable()) {
-        $abort = true;
-        break;
+      $mini_blocks = str_split($block, 4027);
+      foreach ($mini_blocks as $mini_block) {
+        if (mt_rand(0, 299) == 0) {
+          die(0);
+        }
+
+        if (!$this->isWritable()) {
+          $abort = true;
+          break;
+        }
+        $this->writeData($mini_block);
       }
-      $this->writeData($block);
     }
 
     $response->didCompleteWrite($abort);

I couldn't get wget to butcher the file across many invocations of:

wget --verbose --waitretry 0 -O - http://local.phacility.com/file/data/@local/rxambpawebwiwvnk566n/PHID-FILE-z7dj7esovuu3uosq7eer/12MB.txt | md5

...which is making me more and more suspicious of CloudFront -- but surely CloudFront isn't just super broken for big files? That can't possibly be the case, right?

I can't get anything to break at all now. 💁

epriestley mentioned this in T13164: Plans: 2018 Week 31 - 33 Bonus Content.Jul 30 2018, 3:26 PM

We have a more reliable break for this now, in the form:

curl --verbose -H 'Host: d27urz3c38hyx4.cloudfront.net'  https://admin.phacility.com/file/.../50GB.gz > /dev/null

One issue is straightforward:

[Mon Jul 30 16:09:18.032187 2018] [:error] [pid 1927] [client 172.30.0.59:38392] >>> UNRECOVERABLE FATAL ERROR <<<\n\nMaximum execution time of 30 seconds exceeded\n\n/core/lib/libphutil/src/future/http/HTTPSFuture.php:381\n\n\n\xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb \xef\xb8\xb5 \xc2\xaf\\_(\xe3\x83\x84)_/\xc2\xaf \xef\xb8\xb5 \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb

This also possibly appears to be segfaulting apache, which is less good/fixable, but maybe it's coincidental or unrelated:

[Mon Jul 30 16:07:24.202816 2018] [core:notice] [pid 29664] AH00051: child pid 856 exit signal Segmentation fault (11), possible coredump in /etc/apache2

epriestley added a revision: D19547: Remove the execution time limit (if any) before sinking HTTP responses.Jul 30 2018, 4:45 PM

epriestley added a commit: rP81a7c9ac43f2: Remove the execution time limit (if any) before sinking HTTP responses.Jul 30 2018, 5:56 PM

epriestley added a commit: rP58e8d3c13479: (stable) Remove the execution time limit (if any) before sinking HTTP responses.

epriestley added a revision: D19548: Possibly increase the width of the S3 -> Files -> Client pipeline.Jul 30 2018, 7:51 PM

epriestley mentioned this in T13393: Improve repository shard migration pathway in the shared cluster.Aug 29 2019, 3:04 PM

epriestley mentioned this in T13396: Support downloads of 2GB+ files with HTTPSFuture.Aug 29 2019, 3:51 PM

This task covers a lot of ground and many of the issues have been resolved. There are two remaining issues which are more narrowly covered by these followups:

Unable to download large files through alternative file domainClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Unable to download large files through alternative file domain
Closed, ResolvedPublic
Actions