Page MenuHomePhabricator

Very large objects may not be distributable via CDN (CloudFront has 20GB object limit)
Open, LowPublic

Description

See PHI1357, where an instance is requesting a 21GB export.

CloudFront has a 20GB object limit:

Maximum file size for HTTP GET, POST, and PUT requests20 GB

https://docs.amazonaws.cn/en_us/AmazonCloudFront/latest/DeveloperGuide/cloudfront-limits.html

So files >20GB can not be served through CloudFront. The apparent behavior is that when the origin server says "Content-Length: More than 20GB", CloudFront immediately gives up and returns HTTP/400 to the client.

There's some theoretical value to serving arbitrarily large files through a CDN even if they will not be cached: the cost of the path from the user to the CDN plus the cost of the path from the CDN to the datacenter might be lower than the cost of the path from the user to the datacenter, because CDNs do a lot of fancy stuff with DNS and edge nodes and backbones and BGP and the speed of light.

In theory, CDNs probably (?) should not have any object limit, and should just opt large objects out of caching, and origins should reasonably serve large objects through the CDN even if they do not expect them to be cached since we expect this to usually be a route optimization if the CDN is doing their job.

However, it's also kind of understandable that CloudFront just gives up if the origin server says it's about to unleash >20GB, and we could imagine there might be other reasons to want to bypass the CDN for a subset of requests. Currently, there's no support for this and the origin redirects you to the CDN if you try to make a direct request, although this is also intentional and important as an XSS protection measure.

So a complete fix here would be to change security.alternate-file-domain into a list of domains with configurable rules for selecting them, then set up two separate domains (one pointed at the CDN, one pointed at the origin) and send CDN traffic to the CDN alternate and non-CDN traffic to the origin alternate. What a mess.

In the meantime, the "fix" is to make a request to the origin with a fake CDN "Host" header, e.g.:

$ wget --header "Host: cdn.example.com" https://origin.company.com/file/...

This works great, but there's no way anyone could ever figure it out on their own.