This is a followup to T4209, which is an old task with a long history. The heart of that task is still relevant, but most of the details have since fallen out of date so I'm wiping the slate clean.
The current state of the world is that there are two major development pathways to improve availability:
- Daemons + Repositories: Allow installs to run as daemons and repositories on a bunch of hosts in different datacenters and transparently survive losses of most of them.
- Databases: Support replicas and manual promotion in a first-class way. I don't currently plan to survive the loss of the primary database completely transparently, but we can make Phabricator understand replicas and implement a degraded read-only mode.
These pathways serve a "very little downtime" disaster recovery plan where operations personnel verify and promote a replica after a datacenter explodes, you lose no data (maybe a few seconds if it hadn't replicated out of the blast radius yet) and Phabricator can run in a degraded mode until the promotion happens. We do not currently plan to solve any hard consensus problems or automatically fail over the master without human intervention. We can consider these cases once the manual switch works.
The major tasks on the repositories pathway are T2783 (allow daemons to run anywhere) and T4292 (allow repositories to have multiple masters).
The major task on the databases pathway is T4571 (implement a read-only mode).
There are some other services (Drydock, Notifications) which may need additional availability plans in the long term, but losing these is currently not usually a big deal and they aren't stateful so no data is at risk. If your datacenter exploded, you probably don't care too much that notifications aren't realtime for a while.
The short-term plan for availability is to prototype both pathways and get a better sense of how involved they really are, then build them out once there's a clearer picture of which changes can have the greatest impact.