Discussion of multi-host / high-availability stuff. These features serve three major use cases:
- Large installs that want to improve availability (e.g., if a machine dies, failover should be as painless as possible and not involve full restore from backup).
- Phacility SAAS, where many installs are served by a homogenous web tier.
- Scaling reads for huge/public/open source installs.
The major considerations are:
- letting many web frontends and daemon hosts access a small number of copies of a repository;
- having a database failover strategy (and possibly formalizing read/write databases);
- having a repository failover strategy; and
- routing SSH requests.
Web/Daemon Access to Repositories: Currently, webservers access repositories by running a PullLocal daemon in --no-discover mode. This keeps up-to-date copies of repositories on all the web frontends. Facebook is likely the only install which uses this, and it does not currently supported hosted repositories (they're months behind that appearing in the upstream).
Looking forward, in the Phacility SAAS case and in the general case of large installs, this is not a very scalable strategy. We tend to incur costs on the order of O(WebFrontends * NumberAndSizeOfRepositories), because each repository needs to be kept on each frontend. This will hit scaling limits fairly quickly, and we should abandon it as soon as we're able to.
The intended strategy for accomplishing this is to move all repository access to Conduit, and let Conduit route requests to the right place. The web UI already does this, although the daemons do not yet, and not all of the infrastructure is in place here. When this does work, it means that we only need one copy of each repository to exist across the host pools, and it can satisfy all of the requests to that repository. This will also let us spread repository masters across as many machines as we want, and also spread daemons across machines. Finally, we can remove the --no-discovery daemons on the web frontends and make them pure web boxes which run web processes only.
Implementation here is mostly straightforward and many of the building blocks are in place, although it will be time consuming to complete.
Database Failover: Currently, there is no official plan for setting up database contingencies. Likely, this comes in two forms:
- You set up a MySQL slave, and when the master fails you point Phabricator at the slave. Phabricator doesn't need to know about this at all.
- You set up one or more MySQL slaves, and when the master fails you point Phabricator at a slave. In the meantime, you tell Phabricator about the slaves and it routes read connections to them. There is some discussion of this in T1969, although that task is a sticky mire. The major difficulty with this is figuring out how to approach read-after-write.
Repository Failover: This probably looks like database failover, but we need to do more work on our side. Likely, we'll map each repository to a master and zero or more slave(s), and mirror the slaves after commits (by pushing in Git and Mercurial, and with svnsync in SVN?). Since we'll know about the slaves, we can balance reads to them. This has fewer read-after-write problems, although they're still present. Apparently Gitolite does a passable job of this, so I can double check what it's doing. This seems very easy if the readers can lag, and tractable if they aren't allowed to lag.
Routing SSH: In the large-scale case, we need to be able to receive SSH on many hosts and route it correctly. We have much of what we need in place to do this (we decode protocol frames and can detect which repository a request targets and whether it's a read or a write very quickly), but don't actually have the interface layer in place where we examine the request and decide how to route it. This needs to get built; for small installs it will just be "route locally".
There are some other considerations:
- what the management UI looks like;
- the conduit protocol; and
- automated failover.
Management UI: Managing host clusters and roles may get complicated, especially in the Phacility case. I'm not sure if it's worthwhile to build a general-purpose tool for it -- basically, something a little bit like Facebook's SMC, where you have a central console for bringing masters down, toggling failover, etc. This might make sense for Phacility but might be an overreach for everyone else. Needs more consideration.
Conduit Protocol: We probably need to do conduit SSH support (T550) and revisit the protocol as part of the proxying junk.
Automated Failover: I don't plan to support this for now, since I think it often causes more problems than it's worth. We can look at this once everything's stable, but for now I'm assuming an admin will actually flip the failover switch if a machine bites the dust, and any detection will focus on alerting rather than recovery.