Page MenuHomePhabricator

Investigate browser-upload-to-S3 to reduce load on web pool during large file uploads
Open, Needs TriagePublic

Description

This is super-neat and I've always wanted to use it: http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html

Basically, we generate and sign a JSON blob with rules about destination path, max upload size, expiration time, etc that a client can then use as credentials to POST directly to S3, instead of going through the web pool. This wouldn't help for installations using non-S3 storage backends, but it would take some load off the SaaS cluster. arc upload could use this as well.

T12605 mentions large file uploads as one of the causes of long-lived, high-memory apache child processes and this could potentially alleviate it.

Event Timeline

I suspect this isn't worth the complexity. If you drag-and-drop, the maximum request size is 4MB (one chunk) and we get a nice progress bar, resumable uploads, and the bucket can remain completely private. And we can do encryption and deduplicate file blocks.

Critically, we can't chunk files if we do one-shot uploads directly to S3. Then when users download them, we have to write a bunch of "Range: bytes" stuff to read the data in blocks instead of buffering the whole thing in memory.

There are a couple of workflows which still don't use the fancy new chunking stuff, but we could convert them if they represent a meaningful fraction of web tier requests.

4MB isn't going to finish in 100ms but it's not going to take 5 hours either.

We could measure how common this is to see if it's actually taking up more resources than I think, but my guess is that this doesn't represent an appreciable fraction of load.

nice progress bar

Well, progress indicator at least.

(We might draw an actual bar on arc upload?)

What would the appropriate query be? "Number of uploaded files with size greater than X as a percentage of all files/file bytes"?

Actually, it might be better to just test it empirically by throwing a bunch of concurrent uploads at a single instance and seeing how apache memory usage grows.

Maybe "duration of all file upload requests" / "duration of all requests". That isn't a perfect metric but should be a reasonable-ish proxy for real resources we care about, I think.

The memory usage should be pretty fixed at about ~8MB/simultaneous chunk, or ~512MB if all 64 workers are uploading. We upload a 4MB chunk to PHP, then send that chunk to S3. Maybe we end up creating a couple more copies of it in memory somewhere an it's actually ~16MB/chunk but it should be a fixed, small amount regardless of how large the upload is.