Although we're probably in OK shape, our CDN implementation suffers from the various races that Facebook suffered in 2007. In particular, this is possible:
- We push one web machine.
- User hits it, makes a request to the CDN for an updated resource: /res/abcdef/stuff.js.
- This hits an unpushed web machine, and gets the old resource.
- Cache is poisoned.
For now, we can work around this by (a) stopping the world for updates and (b) updating the whole web tier before the admin tier.
In the nearish term, we can push resources into the content instances's database cache before updating web machines, which should get us enough runway for a while.