Page MenuHomePhabricator

Multiserver / High-Availability Configuration
Closed, DuplicatePublic

Description

Discussion of multi-host / high-availability stuff. These features serve three major use cases:

  • Large installs that want to improve availability (e.g., if a machine dies, failover should be as painless as possible and not involve full restore from backup).
  • Phacility SAAS, where many installs are served by a homogenous web tier.
  • Scaling reads for huge/public/open source installs.

The major considerations are:

  • letting many web frontends and daemon hosts access a small number of copies of a repository;
  • having a database failover strategy (and possibly formalizing read/write databases);
  • having a repository failover strategy; and
  • routing SSH requests.

Web/Daemon Access to Repositories: Currently, webservers access repositories by running a PullLocal daemon in --no-discover mode. This keeps up-to-date copies of repositories on all the web frontends. Facebook is likely the only install which uses this, and it does not currently supported hosted repositories (they're months behind that appearing in the upstream).

Looking forward, in the Phacility SAAS case and in the general case of large installs, this is not a very scalable strategy. We tend to incur costs on the order of O(WebFrontends * NumberAndSizeOfRepositories), because each repository needs to be kept on each frontend. This will hit scaling limits fairly quickly, and we should abandon it as soon as we're able to.

The intended strategy for accomplishing this is to move all repository access to Conduit, and let Conduit route requests to the right place. The web UI already does this, although the daemons do not yet, and not all of the infrastructure is in place here. When this does work, it means that we only need one copy of each repository to exist across the host pools, and it can satisfy all of the requests to that repository. This will also let us spread repository masters across as many machines as we want, and also spread daemons across machines. Finally, we can remove the --no-discovery daemons on the web frontends and make them pure web boxes which run web processes only.

Implementation here is mostly straightforward and many of the building blocks are in place, although it will be time consuming to complete.

Database Failover: Currently, there is no official plan for setting up database contingencies. Likely, this comes in two forms:

  1. You set up a MySQL slave, and when the master fails you point Phabricator at the slave. Phabricator doesn't need to know about this at all.
  2. You set up one or more MySQL slaves, and when the master fails you point Phabricator at a slave. In the meantime, you tell Phabricator about the slaves and it routes read connections to them. There is some discussion of this in T1969, although that task is a sticky mire. The major difficulty with this is figuring out how to approach read-after-write.

Repository Failover: This probably looks like database failover, but we need to do more work on our side. Likely, we'll map each repository to a master and zero or more slave(s), and mirror the slaves after commits (by pushing in Git and Mercurial, and with svnsync in SVN?). Since we'll know about the slaves, we can balance reads to them. This has fewer read-after-write problems, although they're still present. Apparently Gitolite does a passable job of this, so I can double check what it's doing. This seems very easy if the readers can lag, and tractable if they aren't allowed to lag.

Routing SSH: In the large-scale case, we need to be able to receive SSH on many hosts and route it correctly. We have much of what we need in place to do this (we decode protocol frames and can detect which repository a request targets and whether it's a read or a write very quickly), but don't actually have the interface layer in place where we examine the request and decide how to route it. This needs to get built; for small installs it will just be "route locally".


There are some other considerations:

  • what the management UI looks like;
  • the conduit protocol; and
  • automated failover.

Management UI: Managing host clusters and roles may get complicated, especially in the Phacility case. I'm not sure if it's worthwhile to build a general-purpose tool for it -- basically, something a little bit like Facebook's SMC, where you have a central console for bringing masters down, toggling failover, etc. This might make sense for Phacility but might be an overreach for everyone else. Needs more consideration.

Conduit Protocol: We probably need to do conduit SSH support (T550) and revisit the protocol as part of the proxying junk.

Automated Failover: I don't plan to support this for now, since I think it often causes more problems than it's worth. We can look at this once everything's stable, but for now I'm assuming an admin will actually flip the failover switch if a machine bites the dust, and any detection will focus on alerting rather than recovery.

Related Objects

StatusAssignedTask
Duplicateepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
ResolvedNone
ResolvedNone
Resolvedepriestley
Resolvedepriestley
Openepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
ResolvedNone
OpenNone
Openepriestley
Resolvedepriestley
Resolvedepriestley

Event Timeline

epriestley claimed this task.
epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added subscribers: epriestley, zeeg.

I'm presuming that in a high availability configuration, there'd be machines running Phabricator with a web interface, but are just responsible for running daemons or repository hosting / replication?

I have a diff that locks down Phabricator when configured as a daemon tier machine if that's desirable. It basically disables all non-daemon related applications and prevents all non-administrators from logging into the machine.

That's potentially useful, yes. I think a rough sketch of the plan of attack here is:

Finish T2783: Most of the remaining work on T2783 can happen at any time: convert remaining calls in Diffusion and the daemons (other than the PullLocal daemon) into Conduit calls. There should be only a handful of these left.

Build a Service Directory: This is a new application which lists all the hosts which provide services. For example: all of the machines running with some Phabricator responsibilities, all the MySQL databases, etc. It might make sense to generalize this. For example, maybe it would be reasonable to list Jenkins instances as services here for Harbormaster to use?

At Facebook, there was an internal tool called the "Service Management Console", which basically acted like DNS-with-extensions for services. You could go look up a central database (approximately, issue a DNS query for "cdb023.facebook.com", essentially) and get a list of available servers, but with a bunch of extra attributes like "this is available", "this is read/write vs read-only", etc. DBAs could swap hosts from the web UI easily, and everyone could see service status. This tool was significantly useful and I suspect it's worthwhile to build something at least slightly general.

One possibility is that this is just part of Drydock, although I think it's likely to have enough meat to justify being a separate application.

Support Host Identity and Authentication: Hosts need to know who they are (so they know which services they should provide, like Conduit vs Web, and which repositories they should host, and which calls should be routed locally vs remotely). They also need to be able to identify themselves to one another. I think the most straightforward way to do this is through private keys, which are sufficient to accomplish both goals. We can maybe even use the machine private keys (/etc/ssh_host_rsa_key, e.g.), with an option to use a specific alternate private key.

If SSH is enabled, machines can make service calls over SSH directly (ssh ... conduit method.name). If SSH is not enabled, we can sign HTTP requests using keys to achieve the same effect.

So machines would look up their public key in the service directory. If they find it, they say "Oh, I am phabweb03, I should provide web services to users" or "Oh, I am phabdaemon09, I should provide SSH/Conduit services to other hosts only, and I host these 73 repositories: ...". To make a service call, they look up the correct host and connect to it using their private key to sign an SSH/HTTP request. The other end of the connection looks up the public key and identifies the internal service.

Probably sort out Host routing?: Discussion in T5702. We have some mess around how we route "Host" headers, which might need to get sorted out here. This doesn't really block anything, but the "lock down the web UI" diff probably touches this code and makes it more complicated. By refactoring it to be more general, we could have machines which aren't supposed to provide user web services just not serve the user web services "virtual host", which could be an easy, effective way to lock segments of functionality.

That's potentially useful, yes. I think a rough sketch of the plan of attack here is:

Finish T2783: Most of the remaining work on T2783 can happen at any time: convert remaining calls in Diffusion and the daemons (other than the PullLocal daemon) into Conduit calls. There should be only a handful of these left.

Build a Service Directory: This is a new application which lists all the hosts which provide services. For example: all of the machines running with some Phabricator responsibilities, all the MySQL databases, etc. It might make sense to generalize this. For example, maybe it would be reasonable to list Jenkins instances as services here for Harbormaster to use?

I think having this generalized would be very useful; if something like this was available, I think there's a high probability we'd be using it at our workplace.

At Facebook, there was an internal tool called the "Service Management Console", which basically acted like DNS-with-extensions for services. You could go look up a central database (approximately, issue a DNS query for "cdb023.facebook.com", essentially) and get a list of available servers, but with a bunch of extra attributes like "this is available", "this is read/write vs read-only", etc. DBAs could swap hosts from the web UI easily, and everyone could see service status. This tool was significantly useful and I suspect it's worthwhile to build something at least slightly general.

Again I think this has the potential to be highly useful; the AWS console doesn't provide the level of granularity needed when attempting to work out where services are running and what particular servers do.

One possibility is that this is just part of Drydock, although I think it's likely to have enough meat to justify being a separate application.

I'd imagine that Drydock might use the "Service Management Console" to query as to whether hosts are still in a good condition maybe? If we're generalizing the latter I'd expect there to be some way of configuring a "this host is online" ping / HTTP request or something of that nature.

Support Host Identity and Authentication: Hosts need to know who they are (so they know which services they should provide, like Conduit vs Web, and which repositories they should host, and which calls should be routed locally vs remotely). They also need to be able to identify themselves to one another. I think the most straightforward way to do this is through private keys, which are sufficient to accomplish both goals. We can maybe even use the machine private keys (/etc/ssh_host_rsa_key, e.g.), with an option to use a specific alternate private key.

If SSH is enabled, machines can make service calls over SSH directly (ssh ... conduit method.name). If SSH is not enabled, we can sign HTTP requests using keys to achieve the same effect.

So machines would look up their public key in the service directory. If they find it, they say "Oh, I am phabweb03, I should provide web services to users" or "Oh, I am phabdaemon09, I should provide SSH/Conduit services to other hosts only, and I host these 73 repositories: ...". To make a service call, they look up the correct host and connect to it using their private key to sign an SSH/HTTP request. The other end of the connection looks up the public key and identifies the internal service.

This all sounds like a great idea. In particular, Conduit over SSH probably means the daemon and storage tiers don't need to have a web server at all, since all of the API methods can be running over SSH (which is probably far more reliable anyway).

Probably sort out Host routing?: Discussion in T5702. We have some mess around how we route "Host" headers, which might need to get sorted out here. This doesn't really block anything, but the "lock down the web UI" diff probably touches this code and makes it more complicated. By refactoring it to be more general, we could have machines which aren't supposed to provide user web services just not serve the user web services "virtual host", which could be an easy, effective way to lock segments of functionality.

I think Conduit over SSH should pretty much resolve any need to have a web server running on the daemon / storage tiers. Basically any initial host-specific configuration can be done through bin/config, and we can probably route any non-locked, runtime host-specific configuration can be routed over Conduit / SSH and displayed in the service management console (for example, operations to migrate a Git repository from one host to another, or something like that?)

If SSH is enabled, machines can make service calls over SSH directly (ssh ... conduit method.name). If SSH is not enabled, we can sign HTTP requests using keys to achieve the same effect.

I'd also pretty much argue that in a High Availability configuration, you should just be using SSH here and then we don't need to bother with signing HTTP requests or running web servers at the daemon / storage tier.

I think the motivation for HTTP is likely to be performance, since the overhead of spinning up an ssh to do git cat-file in order to show a user file content may be higher than we want to pay. Pure SSH is fine for the daemons. If we can get away with it, it would definitely be nice to use pure SSH everywhere.

Specifically, performance from the Diffusion browse views.

Specifically, performance from the Diffusion browse views.

What about using persistent SSH control connections to avoid spinning up a new connection for each request. I can't think of any reason that wouldn't work?

I think the first SSH connection is still more expensive than several HTTP connections, and we can't easily just pool all traffic from a host over a single connection because a host may run multiple instances of Phabricator -- and we already built all the HTTP stuff anyway.

Specifically, here's the progress of the stuff I outlined above:

  • Finish T2783: Nearly complete for Git, D11874 might be the last callsite. Needs more work for Mercurial/Subversion.
  • Build a Service Directory: This is the Almanac application, and in production in this role in the Phacility cluster since February.
  • Host Identity: Conduit has supported public/private key authentication over HTTP since early this year, and this is also in production in the Phacility cluster since launch.
  • Host Headers: Host handling got split out fairly nicely and is visible in ConfigSites.

On the top-level goals:

  • Web/Daemon Access to Repositories: Web access is effectively complete, and in production since February, with one currently known bug (T9319). Daemon access is substantially complete but needs a bit more work (the "Finish T2783" stuff).
  • Database Failover: No progress on this.
  • Repository Failover: Some indirect progress, but this mostly depends on moving T4292 forward.
  • Routing SSH: Complete and in production since February.
  • Management UI: Substantially complete (Almanac).

Overall, if you have an exceptionally detailed understanding of technically-functional-but-mostly-undocumented Phabricator features, here's roughly what you can deploy in a cluster today and soon:

TodayAfter T2783After T4292
Web HostsUnlimitedUnlimitedUnlimited
Daemon Hosts1 (Must also run repos on this host)UnlimitedUnlimited
Repository Hosts1 (Must also run daemons on this host)Unlimited, but losing a host impacts service for some repositoriesUnlimited
Database Hosts111
Notification Hosts111

In cases I've noted as "Unlimited" without qualification, losing hosts does not impact service availability (except that you'll have less capacity).

In all cases, a single host can serve multiple roles (you can put a total of 2 hosts in production, put repo + daemon + web on each, and get HA on those services after T4292).

The amount of work in T2783 is not very large, but not trivial either.

The amount of work in T4292 is a bit more substantial, but I think it's well-defined and surmountable.

We haven't made any progress on databases and the pathway forward there isn't very concrete, although I don't think it's hugely complex overall.

The current HA plan for the notification server is "suffer without it until it gets fixed". It would probably be relatively easy to make this HA (or "more HA"), but it doesn't seem terribly important.

Although some of this is in production, there's essentially zero documentation on any of this (and I don't expect installs to be able to figure it out) because you can only configure "half-HA-of-unimportant-nodes" today, which is great if you're running the Phacility cluster and primarily care about serving a large number of Phabricator instances on a single hardware pool, but I assume not hugely useful for anyone else. I expect to complete at least T2783 before we have a real user-facing narrative for configuring this stuff, and ideally both T2783 and T4292, and really ideally also get databases sorted.

eadler added a project: Restricted Project.Jan 9 2016, 12:34 AM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 9 2016, 12:37 AM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 24 2016, 12:07 AM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 7 2016, 6:05 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 7 2016, 6:07 PM

I'm merging this into T10751, which is a cleaner followup without two years of outdated history. The goals remain the same.