Multiserver / High-Availability Configuration
OpenPublic

Description

Discussion of multi-host / high-availability stuff. These features serve three major use cases:

  • Large installs that want to improve availability (e.g., if a machine dies, failover should be as painless as possible and not involve full restore from backup).
  • Phacility SAAS, where many installs are served by a homogenous web tier.
  • Scaling reads for huge/public/open source installs.

The major considerations are:

  • letting many web frontends and daemon hosts access a small number of copies of a repository;
  • having a database failover strategy (and possibly formalizing read/write databases);
  • having a repository failover strategy; and
  • routing SSH requests.

Web/Daemon Access to Repositories: Currently, webservers access repositories by running a PullLocal daemon in --no-discover mode. This keeps up-to-date copies of repositories on all the web frontends. Facebook is likely the only install which uses this, and it does not currently supported hosted repositories (they're months behind that appearing in the upstream).

Looking forward, in the Phacility SAAS case and in the general case of large installs, this is not a very scalable strategy. We tend to incur costs on the order of O(WebFrontends * NumberAndSizeOfRepositories), because each repository needs to be kept on each frontend. This will hit scaling limits fairly quickly, and we should abandon it as soon as we're able to.

The intended strategy for accomplishing this is to move all repository access to Conduit, and let Conduit route requests to the right place. The web UI already does this, although the daemons do not yet, and not all of the infrastructure is in place here. When this does work, it means that we only need one copy of each repository to exist across the host pools, and it can satisfy all of the requests to that repository. This will also let us spread repository masters across as many machines as we want, and also spread daemons across machines. Finally, we can remove the --no-discovery daemons on the web frontends and make them pure web boxes which run web processes only.

Implementation here is mostly straightforward and many of the building blocks are in place, although it will be time consuming to complete.

Database Failover: Currently, there is no official plan for setting up database contingencies. Likely, this comes in two forms:

  1. You set up a MySQL slave, and when the master fails you point Phabricator at the slave. Phabricator doesn't need to know about this at all.
  2. You set up one or more MySQL slaves, and when the master fails you point Phabricator at a slave. In the meantime, you tell Phabricator about the slaves and it routes read connections to them. There is some discussion of this in T1969, although that task is a sticky mire. The major difficulty with this is figuring out how to approach read-after-write.

Repository Failover: This probably looks like database failover, but we need to do more work on our side. Likely, we'll map each repository to a master and zero or more slave(s), and mirror the slaves after commits (by pushing in Git and Mercurial, and with svnsync in SVN?). Since we'll know about the slaves, we can balance reads to them. This has fewer read-after-write problems, although they're still present. Apparently Gitolite does a passable job of this, so I can double check what it's doing. This seems very easy if the readers can lag, and tractable if they aren't allowed to lag.

Routing SSH: In the large-scale case, we need to be able to receive SSH on many hosts and route it correctly. We have much of what we need in place to do this (we decode protocol frames and can detect which repository a request targets and whether it's a read or a write very quickly), but don't actually have the interface layer in place where we examine the request and decide how to route it. This needs to get built; for small installs it will just be "route locally".


There are some other considerations:

  • what the management UI looks like;
  • the conduit protocol; and
  • automated failover.

Management UI: Managing host clusters and roles may get complicated, especially in the Phacility case. I'm not sure if it's worthwhile to build a general-purpose tool for it -- basically, something a little bit like Facebook's SMC, where you have a central console for bringing masters down, toggling failover, etc. This might make sense for Phacility but might be an overreach for everyone else. Needs more consideration.

Conduit Protocol: We probably need to do conduit SSH support (T550) and revisit the protocol as part of the proxying junk.

Automated Failover: I don't plan to support this for now, since I think it often causes more problems than it's worth. We can look at this once everything's stable, but for now I'm assuming an admin will actually flip the failover switch if a machine bites the dust, and any detection will focus on alerting rather than recovery.

epriestley created this task.Via WebDec 6 2013, 10:27 PM
epriestley claimed this task.
epriestley added subscribers: epriestley, zeeg.
epriestley added a subscriber: tcook.Via WebDec 13 2013, 1:01 AM
epriestley edited this Task.Via LegacyDec 20 2013, 12:01 AM
epriestley edited this Task.Via LegacyJan 8 2014, 4:36 PM
aarwine added a subscriber: aarwine.Via WebJan 16 2014, 1:19 AM
jbrown added a subscriber: jbrown.Via WebMar 7 2014, 5:20 PM
epriestley edited this Task.Via LegacyMar 7 2014, 5:44 PM
joshuaspence added a subscriber: joshuaspence.Via WebJun 13 2014, 4:57 AM
joshuaspence edited this Task.Via LegacyJun 14 2014, 6:01 PM
bartus added a subscriber: bartus.Via WebJun 15 2014, 6:06 PM
joshuaspence edited this Task.Via LegacyJun 16 2014, 10:41 PM
joshuaspence edited this Task.Via LegacyJun 17 2014, 1:10 AM
jevripio added a subscriber: jevripio.Via WebJun 17 2014, 9:25 AM
joshuaspence edited this Task.Via LegacyJun 17 2014, 9:57 PM
nharkins added a subscriber: nharkins.Via WebJun 17 2014, 11:59 PM
joshuaspence edited this Task.Via LegacyJun 18 2014, 1:44 AM
hach-que added a subscriber: hach-que.Via WebAug 2 2014, 2:42 AM

I'm presuming that in a high availability configuration, there'd be machines running Phabricator with a web interface, but are just responsible for running daemons or repository hosting / replication?

I have a diff that locks down Phabricator when configured as a daemon tier machine if that's desirable. It basically disables all non-daemon related applications and prevents all non-administrators from logging into the machine.

epriestley added a comment.Via WebAug 2 2014, 6:02 AM

That's potentially useful, yes. I think a rough sketch of the plan of attack here is:

Finish T2783: Most of the remaining work on T2783 can happen at any time: convert remaining calls in Diffusion and the daemons (other than the PullLocal daemon) into Conduit calls. There should be only a handful of these left.

Build a Service Directory: This is a new application which lists all the hosts which provide services. For example: all of the machines running with some Phabricator responsibilities, all the MySQL databases, etc. It might make sense to generalize this. For example, maybe it would be reasonable to list Jenkins instances as services here for Harbormaster to use?

At Facebook, there was an internal tool called the "Service Management Console", which basically acted like DNS-with-extensions for services. You could go look up a central database (approximately, issue a DNS query for "cdb023.facebook.com", essentially) and get a list of available servers, but with a bunch of extra attributes like "this is available", "this is read/write vs read-only", etc. DBAs could swap hosts from the web UI easily, and everyone could see service status. This tool was significantly useful and I suspect it's worthwhile to build something at least slightly general.

One possibility is that this is just part of Drydock, although I think it's likely to have enough meat to justify being a separate application.

Support Host Identity and Authentication: Hosts need to know who they are (so they know which services they should provide, like Conduit vs Web, and which repositories they should host, and which calls should be routed locally vs remotely). They also need to be able to identify themselves to one another. I think the most straightforward way to do this is through private keys, which are sufficient to accomplish both goals. We can maybe even use the machine private keys (/etc/ssh_host_rsa_key, e.g.), with an option to use a specific alternate private key.

If SSH is enabled, machines can make service calls over SSH directly (ssh ... conduit method.name). If SSH is not enabled, we can sign HTTP requests using keys to achieve the same effect.

So machines would look up their public key in the service directory. If they find it, they say "Oh, I am phabweb03, I should provide web services to users" or "Oh, I am phabdaemon09, I should provide SSH/Conduit services to other hosts only, and I host these 73 repositories: ...". To make a service call, they look up the correct host and connect to it using their private key to sign an SSH/HTTP request. The other end of the connection looks up the public key and identifies the internal service.

Probably sort out Host routing?: Discussion in T5702. We have some mess around how we route "Host" headers, which might need to get sorted out here. This doesn't really block anything, but the "lock down the web UI" diff probably touches this code and makes it more complicated. By refactoring it to be more general, we could have machines which aren't supposed to provide user web services just not serve the user web services "virtual host", which could be an easy, effective way to lock segments of functionality.

hach-que added a comment.Via WebAug 2 2014, 7:06 AM

That's potentially useful, yes. I think a rough sketch of the plan of attack here is:

Finish T2783: Most of the remaining work on T2783 can happen at any time: convert remaining calls in Diffusion and the daemons (other than the PullLocal daemon) into Conduit calls. There should be only a handful of these left.

Build a Service Directory: This is a new application which lists all the hosts which provide services. For example: all of the machines running with some Phabricator responsibilities, all the MySQL databases, etc. It might make sense to generalize this. For example, maybe it would be reasonable to list Jenkins instances as services here for Harbormaster to use?

I think having this generalized would be very useful; if something like this was available, I think there's a high probability we'd be using it at our workplace.

At Facebook, there was an internal tool called the "Service Management Console", which basically acted like DNS-with-extensions for services. You could go look up a central database (approximately, issue a DNS query for "cdb023.facebook.com", essentially) and get a list of available servers, but with a bunch of extra attributes like "this is available", "this is read/write vs read-only", etc. DBAs could swap hosts from the web UI easily, and everyone could see service status. This tool was significantly useful and I suspect it's worthwhile to build something at least slightly general.

Again I think this has the potential to be highly useful; the AWS console doesn't provide the level of granularity needed when attempting to work out where services are running and what particular servers do.

One possibility is that this is just part of Drydock, although I think it's likely to have enough meat to justify being a separate application.

I'd imagine that Drydock might use the "Service Management Console" to query as to whether hosts are still in a good condition maybe? If we're generalizing the latter I'd expect there to be some way of configuring a "this host is online" ping / HTTP request or something of that nature.

Support Host Identity and Authentication: Hosts need to know who they are (so they know which services they should provide, like Conduit vs Web, and which repositories they should host, and which calls should be routed locally vs remotely). They also need to be able to identify themselves to one another. I think the most straightforward way to do this is through private keys, which are sufficient to accomplish both goals. We can maybe even use the machine private keys (/etc/ssh_host_rsa_key, e.g.), with an option to use a specific alternate private key.

If SSH is enabled, machines can make service calls over SSH directly (ssh ... conduit method.name). If SSH is not enabled, we can sign HTTP requests using keys to achieve the same effect.

So machines would look up their public key in the service directory. If they find it, they say "Oh, I am phabweb03, I should provide web services to users" or "Oh, I am phabdaemon09, I should provide SSH/Conduit services to other hosts only, and I host these 73 repositories: ...". To make a service call, they look up the correct host and connect to it using their private key to sign an SSH/HTTP request. The other end of the connection looks up the public key and identifies the internal service.

This all sounds like a great idea. In particular, Conduit over SSH probably means the daemon and storage tiers don't need to have a web server at all, since all of the API methods can be running over SSH (which is probably far more reliable anyway).

Probably sort out Host routing?: Discussion in T5702. We have some mess around how we route "Host" headers, which might need to get sorted out here. This doesn't really block anything, but the "lock down the web UI" diff probably touches this code and makes it more complicated. By refactoring it to be more general, we could have machines which aren't supposed to provide user web services just not serve the user web services "virtual host", which could be an easy, effective way to lock segments of functionality.

I think Conduit over SSH should pretty much resolve any need to have a web server running on the daemon / storage tiers. Basically any initial host-specific configuration can be done through bin/config, and we can probably route any non-locked, runtime host-specific configuration can be routed over Conduit / SSH and displayed in the service management console (for example, operations to migrate a Git repository from one host to another, or something like that?)

hach-que added a comment.Via WebAug 2 2014, 7:08 AM

If SSH is enabled, machines can make service calls over SSH directly (ssh ... conduit method.name). If SSH is not enabled, we can sign HTTP requests using keys to achieve the same effect.

I'd also pretty much argue that in a High Availability configuration, you should just be using SSH here and then we don't need to bother with signing HTTP requests or running web servers at the daemon / storage tier.

epriestley added a comment.Via WebAug 2 2014, 7:16 AM

I think the motivation for HTTP is likely to be performance, since the overhead of spinning up an ssh to do git cat-file in order to show a user file content may be higher than we want to pay. Pure SSH is fine for the daemons. If we can get away with it, it would definitely be nice to use pure SSH everywhere.

epriestley added a comment.Via WebAug 2 2014, 7:17 AM

Specifically, performance from the Diffusion browse views.

kofalt added a subscriber: kofalt.Via WebAug 4 2014, 6:06 AM
webframp added a subscriber: webframp.Via WebNov 24 2014, 7:42 PM
nickz added a subscriber: nickz.Via WebJan 7 2015, 8:54 PM
joshuaspence added a project: Phacility.Via WebJan 22 2015, 7:40 PM
epriestley moved this task to Do Eventually on the Phacility workboard.Via WebJan 23 2015, 11:50 AM
epriestley mentioned this in Starmap.Via WebWed, Apr 15, 11:34 AM

Add Comment