Page MenuHomePhabricator

Make CDN resource population more robust in the Phacility cluster
Closed, ResolvedPublic

Description

Although we're probably in OK shape, our CDN implementation suffers from the various races that Facebook suffered in 2007. In particular, this is possible:

  • We push one web machine.
  • User hits it, makes a request to the CDN for an updated resource: /res/abcdef/stuff.js.
  • This hits an unpushed web machine, and gets the old resource.
  • Cache is poisoned.

For now, we can work around this by (a) stopping the world for updates and (b) updating the whole web tier before the admin tier.

In the nearish term, we can push resources into the content instances's database cache before updating web machines, which should get us enough runway for a while.

Event Timeline

epriestley raised the priority of this task from to Low.
epriestley updated the task description. (Show Details)
epriestley added a project: Phacility.
epriestley updated the task description. (Show Details)
epriestley added a subscriber: epriestley.

After T7172, there is no longer cache cross-contamination between the admin tier and the web tier, so we no longer need to follow rule (b) to deploy safely.

We do still need to stop the world, but we'll do that for some time anyway.

When Phabricator is in a production configuration, we should start checking hashes and 404'ing requests with bad hashes to improve robustness. There's no real reason why we don't do this now, except that Akamai was aggressive about caching 404's at Facebook in 2007.

epriestley claimed this task.

I think this was effectively resolved by D15775, which improved behavior for multiple cluster frontends.

We'll now serve a best-effort resource, but prevent it from being cached if we weren't able to serve it exactly. I think this is the best we can hope for without doing a full separate static resource server like we had at Facebook.