Page MenuHomePhabricator

Allow Phabricator to redirect to Amazon S3 for file downloads
AbandonedPublic

Authored by hach-que on Mar 31 2014, 5:37 AM.
Tags
None
Referenced Files
F14475568: D8655.diff
Fri, Dec 27, 7:33 PM
F14473889: D8655.id24556.diff
Fri, Dec 27, 3:13 PM
Unknown Object (File)
Fri, Dec 13, 8:05 AM
Unknown Object (File)
Thu, Dec 12, 11:05 AM
Unknown Object (File)
Wed, Dec 11, 3:17 PM
Unknown Object (File)
Thu, Dec 5, 2:07 PM
Unknown Object (File)
Sun, Dec 1, 4:52 PM
Unknown Object (File)
Sat, Nov 30, 10:07 PM

Details

Summary

This allows users to configure Phabricator to redirect to Amazon S3 for file downloads. This allows bandwidth costs to be mitigated to Amazon S3 and means that Phabricator doesn't need to download data from S3 to serve it back to the user.

Test Plan

Tested on https://code.redpointsoftware.com.au/. Applied the patch, migrated all the file data, and played around with the configuration options (and tested both downloading files and requesting a Phragment list via getstate).

Diff Detail

Repository
rP Phabricator
Branch
public-uri
Lint
Lint Passed
Unit
Tests Passed

Event Timeline

hach-que retitled this revision from to Allow Phabricator to redirect to Amazon S3 / CloudFront for file downloads.
hach-que updated this object.
hach-que edited the test plan for this revision. (Show Details)
hach-que added a reviewer: epriestley.
hach-que edited edge metadata.

Append the name parameter if provided

Append the name parameter if provided, since S3 doesn't provide redirection rules that we can use to remove any trailing characters, nor does CloudFront. A user accessing files still needs to know the seed, so this doesn't allow users to access files based purely on the name..

Trim trailing slashes on config entry

Add option to configure whether to bypass data controller

This allows Phabricator to be configured to return URLs directly pointing to Amazon instead of redirecting through the data controller. Obviously this exposes raw links to Amazon through the UI and Conduit, but it allows methods such as phragment.getstate to return the public URIs (which means clients don't have to hit Phabricator to download lots of files).

epriestley edited edge metadata.

Throughout this diff, the use of "CloudFront" probably suggests an undesirable configuration. Particularly, the right way to configure Cloudfront (or another CDN) in the long term is to put the entire file domain behind it, so static resources (JS, CSS, images) also benefit from being served through Cloudfront (see T2382).

What's the value of storage.use-direct-public-uri? It seems somewhat hard for users to evaluate the tradeoffs, limits our ability to detect or correct errors, breaks if the storage engine is changed, etc., and we just save one redirect which will eventually be served by a CDN anyway. Am I missing stuff, or is this basically just a micro-optimization?

I'd suggest:

  • Remove references to CloudFront, since it encourages configuration of a CDN a layer below where it should actually go. When we get around to T2382, we'll walk users through setting up CloudFront or another CDN, and all resources will benefit.
  • If possible, make storage.s3.public-uri a boolean instead of a string, so users have a more difficult time misconfiguring it.
  • Remove storage.use-direct-public-uri unless there's a really compelling reason for it.

Then some practical/technical issues I hit:

  • Enabling "Static Web Hosting" doesn't make all the existing junk in my bucket public for me.
  • We store files in S3 as /aa/bb/cc/defghijklmnop, which is probably secret enough, but means that files download as defghijklmnop instead of dog.jpg, which is a (pretty significant?) usability issue.
    • We probably need to (a) publish to /aa/bb/cc/defghijklmnop/dog.jpg and possibly (b) include Content-Disposition information.
    • A downside of including Content-Disposition when publishing to S3 is that we can't change it later. For example, if users change configuration so PDFs are viewable in the browser, any PDFs which have already published will continue to download instead of display.
    • A further downside is that since we deduplicate content, if I upload a picture of you as "jerk.jpg", and you upload the same picture as "coolguy.jpg", your file will download as "jerk.jpg".
    • We could disable content duplication detection for S3 to get around this, although this adds complexity and implies tradeoffs.

As we dig into this, I wonder if T2382 might be a better fix? Using S3 as a file server instead of just a datastore is exposes more complexity than I'd initially thought.

src/applications/files/config/PhabricatorFilesConfigOptions.php
148–150

This caveat is a little misleading -- we always serve files without authentication if you know the data URI for a file, no matter what the file policy is. So this doesn't actually change anything in terms of policies: even without this flag enabled, someone who knows the data URI can access the file, regardless of policy settings.

Instead of having the user set this URI, can we figure it out ourselves (e.g., via the S3 API)? Are there reasons they would want some specific value here other than "yes/no"?

Is the process of enabling 'Static Web Hosting' obvious, impossible to get wrong, and free of side effects (like also enabling directory listings)?

Writing **NOTE:** will disable the specialized note formatting, just use NOTE:. Here are examples -- this is with asterisks:

NOTE: Some note.

This is without:

NOTE: Some note.
This revision now requires changes to proceed.Mar 31 2014, 3:52 PM

Oh, you did deal with the filename thing.

I think the 'name' parameter is not normalized, and needs to go through PhabricatorFile::normalizeFileName() to be usable in a URI. For example, it may contain #, ?, /../, etc.

hach-que edited edge metadata.

Unify configuration options

hach-que edited edge metadata.

Removed a CloudFront reference

hach-que retitled this revision from Allow Phabricator to redirect to Amazon S3 / CloudFront for file downloads to Allow Phabricator to redirect to Amazon S3 for file downloads.Mar 31 2014, 11:38 PM
hach-que updated this object.

I'm finding this causes a few problems, especially because Amazon forces file downloads. This means it's impossible to link people to pictures uploaded on a Phabricator instance without them logging in.

I'm tempted to change this so that the direct public URI is mostly an internal thing, and we have an option in Phragment to use it when returning file URIs, or provide a Conduit API to resolve direct public URIs for a specified list of files. Exposing this functionality to all file links breaks a few workflows.

@epriestley Does this sound like a better solution?

Additionally, if this functionality was restricted to consumers of Conduit then it means we don't need to deal with appending the file name to Amazon storage or deal with the deduplication issue (as a consumer of Conduit, one assumes the program can save the file as whatever it likes).

Only use direct, public URLs for files on S3 in Phragment's Conduit APIs

epriestley edited edge metadata.

I think this no longer makes sense to ever bring upstream: we can now stream large files, and do not store large files as single S3 artifacts.

This revision now requires changes to proceed.May 13 2015, 1:55 PM

To be solved with T5517 instead.