Make CDN resource population more robust in the Phacility cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Feb 3 2015, 11:30 PM

Description

Although we're probably in OK shape, our CDN implementation suffers from the various races that Facebook suffered in 2007. In particular, this is possible:

We push one web machine.
User hits it, makes a request to the CDN for an updated resource: /res/abcdef/stuff.js.
This hits an unpushed web machine, and gets the old resource.
Cache is poisoned.

For now, we can work around this by (a) stopping the world for updates and (b) updating the whole web tier before the admin tier.

In the nearish term, we can push resources into the content instances's database cache before updating web machines, which should get us enough runway for a while.

Related Objects

Mentioned In: T7559: integration with reverse caching proxies such as squid, varnish
Mentioned Here: D15775: Don't cache resources we can't generate properly
T7172: Set up a separate CDN for the admin instance

Event Timeline

epriestley created this task.Feb 3 2015, 11:30 PM

epriestley raised the priority of this task from to Low.

epriestley updated the task description. (Show Details)

epriestley added a project: Phacility.

epriestley updated the task description. (Show Details)

epriestley added a subscriber: epriestley.

epriestley moved this task from Backlog to Do After Launch on the Phacility board.Feb 4 2015, 8:19 PM

After T7172, there is no longer cache cross-contamination between the admin tier and the web tier, so we no longer need to follow rule (b) to deploy safely.

We do still need to stop the world, but we'll do that for some time anyway.

epriestley mentioned this in T7559: integration with reverse caching proxies such as squid, varnish.Mar 14 2015, 7:27 PM

When Phabricator is in a production configuration, we should start checking hashes and 404'ing requests with bad hashes to improve robustness. There's no real reason why we don't do this now, except that Akamai was aggressive about caching 404's at Facebook in 2007.

I think this was effectively resolved by D15775, which improved behavior for multiple cluster frontends.

We'll now serve a best-effort resource, but prevent it from being cached if we weren't able to serve it exactly. I think this is the best we can hope for without doing a full separate static resource server like we had at Facebook.

Herald added a subscriber: eadler. · View Herald TranscriptAug 24 2016, 11:32 PM

Make CDN resource population more robust in the Phacility clusterClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Make CDN resource population more robust in the Phacility cluster
Closed, ResolvedPublic
Actions