Page MenuHomePhabricator

Make Phabricator Highly Available
Closed, ResolvedPublic

Assigned To
Authored By
epriestley
Apr 8 2016, 6:23 PM
Referenced Files
F1308735: Screen Shot 2016-05-11 at 4.39.13 PM.png
May 11 2016, 11:46 PM
F1220870: Screen Shot 2016-04-15 at 4.36.25 PM.png
Apr 16 2016, 12:27 AM
F1219317: Screen Shot 2016-04-14 at 4.58.20 PM.png
Apr 15 2016, 12:09 AM
F1219323: Screen Shot 2016-04-14 at 5.00.57 PM.png
Apr 15 2016, 12:09 AM
F1214413: severed.png
Apr 11 2016, 3:43 PM
F1212532: Screen Shot 2016-04-09 at 3.23.06 PM.png
Apr 10 2016, 4:00 AM
F1212531: Screen Shot 2016-04-09 at 3.06.12 PM.png
Apr 10 2016, 4:00 AM
Tokens
"Mountain of Wealth" token, awarded by yelirekim."Like" token, awarded by salvian."Love" token, awarded by 20after4.

Description

This is a followup to T4209, which is an old task with a long history. The heart of that task is still relevant, but most of the details have since fallen out of date so I'm wiping the slate clean.

The current state of the world is that there are two major development pathways to improve availability:

  • Daemons + Repositories: Allow installs to run as daemons and repositories on a bunch of hosts in different datacenters and transparently survive losses of most of them.
  • Databases: Support replicas and manual promotion in a first-class way. I don't currently plan to survive the loss of the primary database completely transparently, but we can make Phabricator understand replicas and implement a degraded read-only mode.

These pathways serve a "very little downtime" disaster recovery plan where operations personnel verify and promote a replica after a datacenter explodes, you lose no data (maybe a few seconds if it hadn't replicated out of the blast radius yet) and Phabricator can run in a degraded mode until the promotion happens. We do not currently plan to solve any hard consensus problems or automatically fail over the master without human intervention. We can consider these cases once the manual switch works.

The major tasks on the repositories pathway are T2783 (allow daemons to run anywhere) and T4292 (allow repositories to have multiple masters).

The major task on the databases pathway is T4571 (implement a read-only mode).

There are some other services (Drydock, Notifications) which may need additional availability plans in the long term, but losing these is currently not usually a big deal and they aren't stateful so no data is at risk. If your datacenter exploded, you probably don't care too much that notifications aren't realtime for a while.

The short-term plan for availability is to prototype both pathways and get a better sense of how involved they really are, then build them out once there's a clearer picture of which changes can have the greatest impact.

Related Objects

StatusAssignedTask
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
ResolvedNone
OpenNone
Openepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
ResolvedNone
ResolvedNone
Resolvedepriestley
Resolvedepriestley
Openepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Openepriestley
Openepriestley
Openepriestley
OpenNone
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Openepriestley

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm probably done for the night, here's a quick update:

  • Daemons got some planning work (collected in T10756). The immediate issue is that the PullLocal daemon needs to be way smarter about figuring out which repositories are expected to be on the current host, which runs into the repository stuff.
  • Haven't touched the repository stuff yet.
  • Databases seem like they may be the more straightforward of the two tracks and progress has been easy so far, so I'm focusing on that for now.

In practice, nothing very useful has come of this yet. You can:

  • kind of inspect replication status; and
  • manually put an install in read-only mode and sort of limp along.

I think there is no real-world reason to ever do either of these things yet.

Screen Shot 2016-04-09 at 3.06.12 PM.png (656×1 px, 121 KB)

Screen Shot 2016-04-09 at 3.23.06 PM.png (763×1 px, 120 KB)

GoalRefTimeNotes
DaemonsD11874N/AFix last local git operation.
DaemonsT107561 HourOrganize outstanding daemon issues.
DatabasesD15661, D156620.5 HoursVery basic manual read-only mode.
DatabasesD156631 HourPlan and add cluster.databases and write fantasy documentation.
Databasesdb001/002.epriestley.com0.5 HoursLearn how to configure MySQL replication.
DatabasesD156671 HourCluster database status console.
DatabasesD156681 HourRead database masters from cluster config.
Subtotal5 Hours
Cumulative Total5 Hours

Progress so far:

  • Phabricator now survives the loss of the primary database by degrading into read-only mode and serving traffic using a replica (at least, in a controlled sandbox environment).
  • Phabricator can now automatically degrade to read-only mode briefly if it can't connect to the master (temporary interruption).
  • Phabricator can now automatically health-check databases and degrade into a more persistent read-only mode if they repeatedly fail checks (major interruption), and return to read/write mode if/when they recover.

severed.png (869×1 px, 102 KB)

Database availability is on fairly solid footing now (for a feature with zero production hours, at least -- I'm definitely not encouraging anyone to actually deploy this yet). I want to put this install (secure.phabricator.com) in real HA cluster mode before pursuing more database changes so I can get more confidence that the core works, since I believe most remaining features build on top of this and are primarily quality-of-life stuff (although some, like the ability to establish new sessions while in read-only mode, provide very important quality-of-life improvements).

Before doing that, I'm planning to pursue the other pathway (repositories) so I can also get that basically functional too and consolidate all the operations work to swap us into HA mode. There are two major pieces there:

  • Changing repositories to understand that multiple copies may exist on different hosts (and whether a copy should exist on the current host or not, so PullLocal can get smarter).
  • Implementing the logical clocks / synchronization / pull-before-clone stuff to make multiple masters work, described in T4292.

This may run into T10748 at some point but I need to plan the details out a bit first. We probably don't need any editable controls in the UI for a while and may be fine until then.

GoalRefTimeNotes
DatabasesD15669, D15670, D15671, D156721 HourDegrade with no masters.
DatabasesD15673, D156741 HourDegrade with unreachable masters.
DatabasesD156771 HourDegrade more after degrading a little bit.
DatabasesD156790.5 HoursTweaks to degradation, documentation.
Subtotal3.5 Hours
Cumulative Total8.5 Hours

I've started moving these changes into production (T10784) so I can begin gathering confidence that they actually work. The repository stuff definitely still doesn't work since several core pieces are currently implemented as // TODO comments, but it has the synchronization bookkeeping in place for all the locking/versioning: if that's stable, it will give me confidence that we're on relatively firm ground and can move forward.

This has hit a few minor issues and will likely hit a few more before I'm done, but seems to be going alright so far. Two larger issues:


Way Too Much Header Magic: Most of the issues I've hit are related to header/configuration magic. Currently, we deploy a SiteSource in the Phacilty cluster and on this host which does a lot of header fiddling to make the production clustering (which is focused on multi-tenancy, not availability) work properly.

One example is that security.require-https needs to be disabled for cluster-originated requests if you plan to terminate SSL at the load balancer and run intracluster requests over plain HTTP, as we currently do.

Another example is that X-Forwarded-For is not trusted by default, but should be trusted when provided by a load balancer. Likewise, X-Forwarded-Proto should be trusted if the load balancer provides it.

I suspect Phabricator should try to get smarter about dealing with more of this stuff automatically, so we can reduce the need for weird custom header magic. T10784 discusses raising a better error about security.require-https, which is a start, but I think it will be more fruitful to be more ambitious about this. Particularly, we should be able to trust hosts in cluster.addresses by default (or, if necessary, add a separate cluster.nice-trustworthy-load-balancer-address-blocks option), and mostly get things right without additional work once that's configured.


Notification Server: The biggest blocker for putting multiple hosts behind the secure.phabricator.com ELB is that the ELB also routes notification traffic, and the notification server does not currently support any kind of clustering.

I could resolve this by running it through a different ELB (as we do in the production cluster), or by just forwarding 22280 on secure002 directly to 22280 on secure001 (as at least one other install does in T6915), but I suspect a more ambitious pathway through T6915 / T10697 may not be particularly long, and this is work we need to do eventually anyway.

I currently plan to give this some more thought since I'm not sure I have a clustering plan I'm completely happy with yet, but I'm leaning toward pursuing this in the short term.


GoalRefTimeNotes
DaemonsD156820.5 HoursLet PullLocal daemon launch anywhere.
RepositoriesD15683, D156850.5 HoursSurface cluster status in UI.
RepositoriesD156881 HourGroundwork for repository synchronization.
DaemonsD156890.5 HoursDocumentation consolidation.
ProductionT107841 HourBegin putting changes into production.
Subtotal3.5 Hours
Cumulative Total12 Hours

Given current progress and velocity, I'm aiming for a timeline roughly like this:

  • This week (April 16): make all core services on this host (secure) highly available.
  • Next week (April 23): maybe ready for third-party use some time around here? This may be a little ambitious.

I slightly reduced the amount of header magic, and the notification stuff is now cluster-ready (in theory, at least). I'm planning to get notifications, web nodes, and databases redundant in production today, and maybe repositories/daemons if nothing else comes up. No issues with repo bookkeeping so far but even if it's smooth sailing it needs a fair bit of UI work, and same with the daemons.


GoalRefTimeNotes
ConfigD156960.5 HoursMake headers less magical.
NotificationsD15698, D15700, D15701, D15702, D15703, D15705, D15708, D15709(Outside Scope)Notifications quality of life changes.
NotificationsD157111.5 HoursProbably support notification server clustering?
Subtotal2 Hours
Cumulative Total14 Hours

A big chunk of this is finally in production and seems to be working, which I'm thrilled about.


Now In Production

Web Servers: This service (secure.phabricator.com) is now backed by two redundant web nodes in different AWS availability zones. You can visit the loadbalancer status page and reload the page a few times to hit the different hosts.

Databases: We now run a primary on secure001 and a replica on secure002:

Screen Shot 2016-04-14 at 4.58.20 PM.png (264×630 px, 28 KB)

I tested failover by shutting down the master; Phabricator failed over to read-only mode correctly, and recovered cleanly when I restored service.

Daemons: Both hosts are running daemons. The UI needs some updates to make this more clear and there's probably some remaining work here, but I killed the daemons on one host and the obvious things still worked properly.

Notifications: Both hosts are running read/write notification servers:

Screen Shot 2016-04-14 at 5.00.57 PM.png (230×725 px, 35 KB)

This appears to be working properly.

SSH: We got this one for free since I built it in Phacility a long time ago, but SSH is being proxied by both boxes to the underlying repository service.


Remaining Work

Repositories: These aren't redundant yet and we'd currently lose them if secure001 exploded.

Updated Daemon UI: Current UI has a bunch of issues and does not make it clear where daemons are running.

SSH: This works in the Phacility cluster, but is wholly unrealistic for third-parties to configure today and has zero documentation. We also run a modified sshd, but shouldn't require installs to.


GoalRefTimeNotes
DatabasesD15714, D15716, D157170.5 HoursImprove bin/storage awareness of replicas.
ProductionT107843 HoursMake web, database, daemons and notifications redundant on this host.
Subtotal3.5 Hours
Cumulative Total15.5 Hours

Just some cleanup today. New daemon console now shows which hosts daemons are running on:

Screen Shot 2016-04-15 at 4.36.25 PM.png (299×805 px, 68 KB)

The setup for secure has been stable so far.


GoalRefTimeNotes
DaemonsD157240.5 HoursImprove daemon console UI for multi-host setups.
Subtotal0.5 Hours
Cumulative Total16 Hours

secure is now running repositories in fully redundant multi-master mode. This appears to be working, although it has been live for about 20 minutes and survived about three pushes so far so I'm not yet brimming with confidence about it.

If this holds, I think that's about the end of the "hard" stuff. Still plenty of usability / documentation / cleanup / support / performance work remaining (for example, multi-master repositories are currently almost impossible to configure).


GoalRefTimeNotes
RepositoriesD15747, D15748, D157591 HourRepository bookkeeping improvements.
RepositoriesD15751, D15752, D15754, D15755, D15757, D15758, D157614 HoursSupport multi-master repositories.
Subtotal5 Hours
Cumulative Total21 Hours

Mixed progress. Repository clustering has been stable in production since the last update, and appears to actually work, which is good. We've hit a few minor snags (like a missing sync-before-read in diffusion.querycommits) but all pretty much expected stuff and nothing troubling.

However, progress on making repository clustering plausible to deploy, configure and administrate has been slow. There's a lot of ground to cover and Diffusion is already hard to configure and this makes everything harder. I've made some progress that I'm happy with, but I've also made some efforts to reduce and simplify the existing complexity (particularly the sudo/ssh situation, which is particularly complex), without much success with it so far.

Beyond configuration, there's an issue of how to choose which cluster service to host a repository on when creating it. This overlaps heavily with T10748 since EditEngine + custom forms seem like they're probably the best solution, although the current rule ("if there's exactly one, choose that") is probably fine in the short term since I don't expect installs to be deploying multiple clusters for some time.

We had an unrelated production incident this morning that I'm still cleaning up today, and I want to focus on tying up loose ends tomorrow before the release cut, so this probably won't make too much more headway this week.

Here's the general state of the world I expect for this release:

  • Database Replication: Deploy freely. There are rough edges, but deploying this should be reasonable today and should strictly put installs in a more available state. The documentation is up to date (see Clustering: Databases).
  • Aphlict: Deploy cautiously. I think this still has a couple of wrinkles that need to be worked out but it is straightforward to configure and seems to be working properly for the most part.
  • Repositories: Do not deploy. They work but are impossible to configure.
  • Daemons: Must wait for repositories.
  • Web: Must wait for repositories.
  • SSH: Must wait for repositories.

GoalRefTimeNotes
RepositoriesD15766, D157681 HourProvide recovery documentation and tooling.
RepositoriesD15778, D15772, D15765, D15764, D157631 HourBetter monitoring, more documentation.
WebD157750.5 HoursPrevent CDN cache poisoning during deployments.
Subtotal2.5 Hours
Cumulative Total23.5 Hours

I'm still not really thrilled with where the documentation situation on repositories is, but it's getting a little more manageable.

Handling of potential link disruption during writes (discussed in T10860) should be better now, too. I think it should now be fairly hard to freeze repositories outside of an actual disaster which legitimately puts data at risk.

The ssh link got a lot more chatty:

$ git push
...
# Push received by "secure001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "secure002.phacility.net"...
# Acquired read lock immediately.
# Device "secure002.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "secure002.phacility.net".
Counting objects: 13, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (13/13), done.
Writing objects: 100% (13/13), 4.77 KiB | 0 bytes/s, done.
Total 13 (delta 11), reused 0 (delta 0)
# Released cluster write lock.
...

This is probably over-tuned a bit, but should make the operation more obvious and hopefully help debug any issues which do arise. We can quiet it down later.


GoalRefTimeNotes
RepositoriesD15783, D157860.5 HoursMinor repository sync fixes.
DocumentationD15787, D15778, D15788, D15794, D15795, D15796, D157982 HoursCorrect and improve documentation.
RepositoriesD15789, D15790, D15791, D157922 HoursMake repositories chatty and more freeze-resistant when network links are interrupted during a push.
Subtotal4.5 Hours
Cumulative Total28 Hours

T10748, which provides the new cluster management UI for repositories, is nearing completion. This makes cluster setup more manageable and at least provides a pathway forward for Almanac stuff in the future, although it does not yet make Almanac / service management explicitly selectable when creating a repository (for now, new repositories allocate on a random open service).

Most of what remains there is UI/UX and documentation; I expect to complete that, then return here and finish the repository/web/SSH documentation, then consolidate where things are today and plan where it makes the most sense to go next.

The last round of repository changes have been stable in production here on secure, and deployed to the Phacility cluster last week. The cluster is not running HA repository services yet (repository services are still single-node clusters) but it is hitting about 95% of the code (all the locking/versioning/proxying/etc).

T10748 is wrapping up, and repositories now show cluster status in the new "Storage" panel of the new UI:

Screen Shot 2016-05-11 at 4.39.13 PM.png (989×1 px, 136 KB)

(The "Last Writer" column is a bug which I'll fix shortly.)

Per above, I'm expecting to complete the remaining documentation which was waiting on this next, then do some PM'ish stuff to consolidate the state of the world so there's a clearer picture of exactly where things stand and what's expected to work now vs not work yet.

I think this is the current state of the world. I'm planning to tackle the three major issues noted below (observed repository versioning, repository lock granularity, device enrollment behavior) and then try to wind down this phase, move all this followup work to a new "make clustering more powerful" sort of task, and then plan where to go from there based on which goals are most important (scalability vs multi-region vs read-only vs hg/svn vs drydock vs whatever crops up as installs actually deploy this stuff).

From this point, future work has few interconnections and can be prioritized and pursued mostly independently.


General

Today, cluster behavior has a greater emphasis on multi-tenancy and availability/loss-resistance than it does on scalability and multi-region deployments. A mixture of general changes and service-specific changes could improve its suiltability to these tasks in the future.

Databases

The database read-only failover mode still has a number of limitations, some of which may be fairly severe. Improving this mode could reduce the disruption associated with losing masters.

Database clusters only provide redundancy and availability today: no traffic is sent to the replica unless the master is unreachable. In the future, we could send some reads to the replica during normal operation. This likely has the greatest impact for open source installs with a large public userbase.

Database replicas must be promoted manually. I have no current plans to attempt automatic promotion because of the mortal danger that mistakes represent.

All service types other than databases failover automatically today.

Repositories

Hosted Git repositories are in good shape, but other types of repositories still have limited or nonexistent support.

In the future, better administrative tools and smarter connection management could improve the behavior of multi-region clusters.

  • (Major) Observed repositories (vs hosted) do not version correctly.
  • (Future) No explicit management of which cluster service repositories allocate on.
  • (Future) No Mercurial support.
  • (Future) No Subversion support.
  • (Future) T10883: Allow repository cluster nodes to be read-only

Daemons and CLI

Some daemon and administrative behaviors still have rough edges.

Notifications

Notification HTTP / HTTPS could be made easier to configure.

Drydock

Drydock could be significantly better at detecting and recovering from broken resources and losts hosts than it is today. In particular, if you lose connectivity to a datacenter you're probably looking at some degree of manually pulling bad hosts out of the pool.

Two of the three issues discussed above (lock granularity, severity of device enrollment) got tackled. I think I have an attack on versioning hosted repositories, but I'm not going to try to squeeze it in before the release cut today.

Observed repositories should now version in a reasonable way.

I've created followups to track the other points above that didn't previously have dedicated tracking tasks:

This also wasn't explicitly covered above, but is effectively a followup:

I'm going to pause this for feedback since we have no more planned direct action here.

(We are pursuing application partitioning (T11044) and improved drydock recovery (T8153) outside of this, but I'll track those separately since they're only tangentially related to this.)


GoalRefTimeNotes
RepositoriesD159861 HourVersion observed repositories in a reasonable way.
Subtotal1 Hour
Cumulative Total29 Hours
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 6 2016, 4:27 PM

This spawned a healthy set of followups, but I believe the core work described here is now resolved and accounted for and all future work is covered in narrower followup tasks.

urzds added a subscriber: urzds.