Page MenuHomePhabricator

"bin/host upload" and the Uploader wrapper should retry chunks on HTTP/504, and perhaps other status codes
Open, LowPublic

Description

We occasionally get HTTP/504 responses from AWS for no obvious reason, i.e. there's no clear source of extraordinary load on the underlying service. In theory, HTTP/503 and HTTP/429 might also merit retrying after a delay. HTTP technically supports a Retry-After header, as well.

Normally none of this matters much since these responses currently feel rare/sporadic, but the reliability of bin/host upload for very large files (e.g., MySQL dumps) is materially impacted (I just had a 7GB upload hit a 504 and abort, cleaning up the underlying temporary file).

ArcanistFileUploader should retry failed parts, and treat 504 as a retryable failure.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision

Event Timeline

epriestley created this task.

I have a patch for this, but I'm not thrilled about the retry model. Maybe better would be for the caller retry the actual upload operation (which will automatically resume) and bail out while retaining the temporary file. Even if we retry on 504, we lose a lot of progress if there's a service interruption for longer than we're willing to sit in a retry loop.

For posterity, bleugh:

commit 63ffce7e0c01df9c59062951d354c4eeed222a3e
Author: epriestley <git@epriestley.com>
Date:   Thu Aug 29 13:30:38 2019 -0700

    WIP

diff --git a/src/upload/ArcanistFileUploader.php b/src/upload/ArcanistFileUploader.php
index daf4136d..7faad1d0 100644
--- a/src/upload/ArcanistFileUploader.php
+++ b/src/upload/ArcanistFileUploader.php
@@ -262,14 +262,50 @@ final class ArcanistFileUploader extends Phobject {
     foreach ($remaining as $chunk) {
       $data = $file->readBytes($chunk['byteStart'], $chunk['byteEnd']);
 
-      $conduit->callMethodSynchronous(
-        'file.uploadchunk',
-        array(
-          'filePHID' => $file_phid,
-          'byteStart' => $chunk['byteStart'],
-          'dataEncoding' => 'base64',
-          'data' => base64_encode($data),
-        ));
+      $failures = 0;
+      $maximum_failures = 3;
+      while (true) {
+        try {
+          $conduit->callMethodSynchronous(
+            'file.uploadchunk',
+            array(
+              'filePHID' => $file_phid,
+              'byteStart' => $chunk['byteStart'],
+              'dataEncoding' => 'base64',
+              'data' => base64_encode($data),
+            ));
+          break;
+        } catch (HTTPFutureHTTPResponseStatus $ex) {
+          $failures++;
+
+          if ($failures >= $maximum_failures) {
+            throw $ex;
+          }
+
+          // See T13397. If we get status codes which may be temporary, wait
+          // for a moment and retry.
+          switch ($ex->getStatusCode()) {
+            case 429:
+            case 503:
+            case 504:
+              sleep(10);
+              break;
+            default:
+              throw $ex;
+          }
+        } catch (ConduitClientException $ex) {
+          // TODO: This is grotesque, but it's possible that we may upload
+          // a chunk and then suffer some kind of request failure. Treat this
+          // error as a success indicator. The API should not require string
+          // matching against error text.
+          switch ($ex->getMessage()) {
+            case "ERR-CONDUIT-CORE: Chunk has already been uploaded.":
+              break 2;
+            default:
+              throw $ex;
+          }
+        }
+      }
 
       $progress->update(1);
     }
epriestley added a revision: Restricted Differential Revision.Aug 29 2019, 9:14 PM
epriestley added a commit: Restricted Diffusion Commit.Aug 29 2019, 9:17 PM
epriestley added a commit: Restricted Diffusion Commit.

A generally cleaner version of this would also fix the parallelization TODO (just above the patch in the previous comment). This probably needs FutureIterator to be better at managing in-flight changes to the working set.

An even cleaner version would also get rid of the base64 stuff (see T13337).