Make Phabricator Highly Available
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Apr 8 2016, 6:23 PM

Description

This is a followup to T4209, which is an old task with a long history. The heart of that task is still relevant, but most of the details have since fallen out of date so I'm wiping the slate clean.

The current state of the world is that there are two major development pathways to improve availability:

Daemons + Repositories: Allow installs to run as daemons and repositories on a bunch of hosts in different datacenters and transparently survive losses of most of them.
Databases: Support replicas and manual promotion in a first-class way. I don't currently plan to survive the loss of the primary database completely transparently, but we can make Phabricator understand replicas and implement a degraded read-only mode.

These pathways serve a "very little downtime" disaster recovery plan where operations personnel verify and promote a replica after a datacenter explodes, you lose no data (maybe a few seconds if it hadn't replicated out of the blast radius yet) and Phabricator can run in a degraded mode until the promotion happens. We do not currently plan to solve any hard consensus problems or automatically fail over the master without human intervention. We can consider these cases once the manual switch works.

The major tasks on the repositories pathway are T2783 (allow daemons to run anywhere) and T4292 (allow repositories to have multiple masters).

The major task on the databases pathway is T4571 (implement a read-only mode).

There are some other services (Drydock, Notifications) which may need additional availability plans in the long term, but losing these is currently not usually a big deal and they aren't stateful so no data is at risk. If your datacenter exploded, you probably don't care too much that notifications aren't realtime for a while.

The short-term plan for availability is to prototype both pathways and get a better sense of how involved they really are, then build them out once there's a clearer picture of which changes can have the greatest impact.

Revisions and Commits

rPHU libphutil
	D15751	rPHU1341c014ca7c Share Env/CWD methods between Passthru + ExecFuture, expose Env
rP Phabricator
	D15897	rP5162f8109525 Provide SSH host documentation, tweak/supplement cluster documentation
	D15893	rP15f14d6c2f20 Fix improper viewer for Git SSH cluster workflows
	D15766	rP11aa902bd1e7 Show "Last Writer" and "Last Write At" in the UI, add more documentation
	D15768	rPbd4fb3c9fac0 Implement `bin/repository thaw` for unfreezing cluster repositories
	D15765	rPb9cf9e6f0db9 Fix an issue with PHID/handle management in push logs
	D15764	rP48b015a3fa83 Add slightly more cluster repository documentation
	D15763	rPbab3690b547f Fill in missing cluster database documentation
	D15689	rPafb0f7c7af97 Clean up some old cluster-ish documentation

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T10751 Make Phabricator Highly Available
Resolved	epriestley	T4571 Allow Phabricator to run in Read-Only Mode
Resolved	epriestley	T10758 Make `bin/storage dump` replica-aware
Resolved	epriestley	T6996 Write an `--output <file>` mode for `storage dump` which can gzip
Resolved	epriestley	T10759 Run PhabricatorDatabase/MySQLSetupCheck against all configured replicas
Resolved	None	T6710 Support "timeout" in vanilla MySQL connections
Open	None	T8543 Prevent write queries from executing on "r" connections in LiskDAO
Open	epriestley	T10813 Maybe fix various statements that MySQL statement-based replication gets upset about
Resolved	epriestley	T4292 Implement repository replication
Resolved	epriestley	T2783 Make working-copy operations service-oriented
		Restricted Maniphest Task
		Restricted Maniphest Task
		Restricted Maniphest Task
Resolved	epriestley	T6240 Implement Conduit request signing for host-to-host calls
Resolved	None	T7019 Proxy HTTP VCS traffic
Resolved	None	T7020 Proxy Diffusion Conduit requests
		Restricted Maniphest Task
Resolved	epriestley	T10366 General support for multiple URIs for a repository
Resolved	epriestley	T10860 After an inconsistent cluster repository write, consider just ignoring the lock
Open	epriestley	T10861 Provide a tool to rewind the push log for a repository
Resolved	epriestley	T10756 Make daemons work correctly no matter where they are or how many copies are running
Resolved	epriestley	T10811 `phd-daemon` crash after automatic restart following config change
Resolved	epriestley	T7024 Document cluster configuration
Open	epriestley	T10768 Provide tools to drop severed nodes from load balancer pools by failing status checks
Open	epriestley	T10769 Read-Only Mode Errata
Open	epriestley	T12966 Phabricator should survive a restart and setup checks with an unreachable master
Open	None	T8153 Improve detection and recovery when resources are mangled outside of Drydock's control
Resolved	epriestley	T10784 Deploy secure002.phacility.net
Resolved	epriestley	T10843 Don't write Celerity resources which aren't defined in the local map to cache
Resolved	epriestley	T10876 Make the setup warning count cache database-backed
Resolved	epriestley	T10883 Allow repository cluster nodes to be read-only
Open	epriestley	T10884 Sort repository, database and notification services better (by network distance)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• jshirley added a subscriber: • jshirley.Apr 8 2016, 7:42 PM

epriestley closed subtask T2783: Make working-copy operations service-oriented as Resolved.Apr 8 2016, 8:41 PM

dylanvee added a subscriber: dylanvee.Apr 8 2016, 8:51 PM

epriestley created subtask T10756: Make daemons work correctly no matter where they are or how many copies are running.Apr 8 2016, 9:46 PM

chad added a subscriber: chad.Apr 9 2016, 4:25 AM

yelirekim added a subscriber: yelirekim.Apr 9 2016, 8:34 PM

I'm probably done for the night, here's a quick update:

Daemons got some planning work (collected in T10756). The immediate issue is that the PullLocal daemon needs to be way smarter about figuring out which repositories are expected to be on the current host, which runs into the repository stuff.
Haven't touched the repository stuff yet.
Databases seem like they may be the more straightforward of the two tracks and progress has been easy so far, so I'm focusing on that for now.

In practice, nothing very useful has come of this yet. You can:

kind of inspect replication status; and
manually put an install in read-only mode and sort of limp along.

I think there is no real-world reason to ever do either of these things yet.

Screen Shot 2016-04-09 at 3.06.12 PM.png (656×1 px, 121 KB)

Screen Shot 2016-04-09 at 3.23.06 PM.png (763×1 px, 120 KB)

Goal	Ref	Time	Notes
Daemons	D11874	N/A	Fix last local `git` operation.
Daemons	T10756	1 Hour	Organize outstanding daemon issues.
Databases	D15661, D15662	0.5 Hours	Very basic manual read-only mode.
Databases	D15663	1 Hour	Plan and add `cluster.databases` and write fantasy documentation.
Databases	`db001/002.epriestley.com`	0.5 Hours	Learn how to configure MySQL replication.
Databases	D15667	1 Hour	Cluster database status console.
Databases	D15668	1 Hour	Read database masters from cluster config.
Subtotal		5 Hours
Cumulative Total		5 Hours

epriestley added a subtask: T7024: Document cluster configuration.Apr 10 2016, 10:32 AM

epriestley created subtask T10768: Provide tools to drop severed nodes from load balancer pools by failing status checks.Apr 10 2016, 10:44 AM

epriestley created subtask T10769: Read-Only Mode Errata.Apr 10 2016, 10:49 AM

epriestley added a subtask: T8153: Improve detection and recovery when resources are mangled outside of Drydock's control.Apr 10 2016, 7:16 PM

Progress so far:

Phabricator now survives the loss of the primary database by degrading into read-only mode and serving traffic using a replica (at least, in a controlled sandbox environment).
Phabricator can now automatically degrade to read-only mode briefly if it can't connect to the master (temporary interruption).
Phabricator can now automatically health-check databases and degrade into a more persistent read-only mode if they repeatedly fail checks (major interruption), and return to read/write mode if/when they recover.

Database availability is on fairly solid footing now (for a feature with zero production hours, at least -- I'm definitely not encouraging anyone to actually deploy this yet). I want to put this install (secure.phabricator.com) in real HA cluster mode before pursuing more database changes so I can get more confidence that the core works, since I believe most remaining features build on top of this and are primarily quality-of-life stuff (although some, like the ability to establish new sessions while in read-only mode, provide very important quality-of-life improvements).

Before doing that, I'm planning to pursue the other pathway (repositories) so I can also get that basically functional too and consolidate all the operations work to swap us into HA mode. There are two major pieces there:

Changing repositories to understand that multiple copies may exist on different hosts (and whether a copy should exist on the current host or not, so PullLocal can get smarter).
Implementing the logical clocks / synchronization / pull-before-clone stuff to make multiple masters work, described in T4292.

This may run into T10748 at some point but I need to plan the details out a bit first. We probably don't need any editable controls in the UI for a while and may be fine until then.

Goal	Ref	Time	Notes
Databases	D15669, D15670, D15671, D15672	1 Hour	Degrade with no masters.
Databases	D15673, D15674	1 Hour	Degrade with unreachable masters.
Databases	D15677	1 Hour	Degrade more after degrading a little bit.
Databases	D15679	0.5 Hours	Tweaks to degradation, documentation.
Subtotal		3.5 Hours
Cumulative Total		8.5 Hours

epriestley added a revision: D15689: Clean up some old cluster-ish documentation.Apr 12 2016, 5:48 PM

epriestley created subtask T10784: Deploy secure002.phacility.net.Apr 12 2016, 6:06 PM

epriestley mentioned this in T10786: Support for process supervision for phabricator daemons.Apr 13 2016, 2:08 AM

epriestley added a commit: rPafb0f7c7af97: Clean up some old cluster-ish documentation.Apr 13 2016, 2:14 AM

I've started moving these changes into production (T10784) so I can begin gathering confidence that they actually work. The repository stuff definitely still doesn't work since several core pieces are currently implemented as // TODO comments, but it has the synchronization bookkeeping in place for all the locking/versioning: if that's stable, it will give me confidence that we're on relatively firm ground and can move forward.

This has hit a few minor issues and will likely hit a few more before I'm done, but seems to be going alright so far. Two larger issues:

Way Too Much Header Magic: Most of the issues I've hit are related to header/configuration magic. Currently, we deploy a SiteSource in the Phacilty cluster and on this host which does a lot of header fiddling to make the production clustering (which is focused on multi-tenancy, not availability) work properly.

One example is that security.require-https needs to be disabled for cluster-originated requests if you plan to terminate SSL at the load balancer and run intracluster requests over plain HTTP, as we currently do.

Another example is that X-Forwarded-For is not trusted by default, but should be trusted when provided by a load balancer. Likewise, X-Forwarded-Proto should be trusted if the load balancer provides it.

I suspect Phabricator should try to get smarter about dealing with more of this stuff automatically, so we can reduce the need for weird custom header magic. T10784 discusses raising a better error about security.require-https, which is a start, but I think it will be more fruitful to be more ambitious about this. Particularly, we should be able to trust hosts in cluster.addresses by default (or, if necessary, add a separate cluster.nice-trustworthy-load-balancer-address-blocks option), and mostly get things right without additional work once that's configured.

Notification Server: The biggest blocker for putting multiple hosts behind the secure.phabricator.com ELB is that the ELB also routes notification traffic, and the notification server does not currently support any kind of clustering.

I could resolve this by running it through a different ELB (as we do in the production cluster), or by just forwarding 22280 on secure002 directly to 22280 on secure001 (as at least one other install does in T6915), but I suspect a more ambitious pathway through T6915 / T10697 may not be particularly long, and this is work we need to do eventually anyway.

I currently plan to give this some more thought since I'm not sure I have a clustering plan I'm completely happy with yet, but I'm leaning toward pursuing this in the short term.

Goal	Ref	Time	Notes
Daemons	D15682	0.5 Hours	Let PullLocal daemon launch anywhere.
Repositories	D15683, D15685	0.5 Hours	Surface cluster status in UI.
Repositories	D15688	1 Hour	Groundwork for repository synchronization.
Daemons	D15689	0.5 Hours	Documentation consolidation.
Production	T10784	1 Hour	Begin putting changes into production.
Subtotal		3.5 Hours
Cumulative Total		12 Hours

Given current progress and velocity, I'm aiming for a timeline roughly like this:

This week (April 16): make all core services on this host (secure) highly available.
Next week (April 23): maybe ready for third-party use some time around here? This may be a little ambitious.

20after4 awarded a token.Apr 13 2016, 9:52 AM

epriestley mentioned this in T10805: Exception in master (related to cluster functioning).Apr 14 2016, 11:37 AM

I slightly reduced the amount of header magic, and the notification stuff is now cluster-ready (in theory, at least). I'm planning to get notifications, web nodes, and databases redundant in production today, and maybe repositories/daemons if nothing else comes up. No issues with repo bookkeeping so far but even if it's smooth sailing it needs a fair bit of UI work, and same with the daemons.

Goal	Ref	Time	Notes
Config	D15696	0.5 Hours	Make headers less magical.
Notifications	D15698, D15700, D15701, D15702, D15703, D15705, D15708, D15709	(Outside Scope)	Notifications quality of life changes.
Notifications	D15711	1.5 Hours	Probably support notification server clustering?
Subtotal		2 Hours
Cumulative Total		14 Hours

epriestley mentioned this in T7024: Document cluster configuration.Apr 14 2016, 9:54 PM

epriestley closed subtask T7024: Document cluster configuration as Resolved.

robla added a subscriber: robla.Apr 14 2016, 10:24 PM

A big chunk of this is finally in production and seems to be working, which I'm thrilled about.

Now In Production

Web Servers: This service (secure.phabricator.com) is now backed by two redundant web nodes in different AWS availability zones. You can visit the loadbalancer status page and reload the page a few times to hit the different hosts.

Databases: We now run a primary on secure001 and a replica on secure002:

Screen Shot 2016-04-14 at 4.58.20 PM.png (264×630 px, 28 KB)

I tested failover by shutting down the master; Phabricator failed over to read-only mode correctly, and recovered cleanly when I restored service.

Daemons: Both hosts are running daemons. The UI needs some updates to make this more clear and there's probably some remaining work here, but I killed the daemons on one host and the obvious things still worked properly.

Notifications: Both hosts are running read/write notification servers:

Screen Shot 2016-04-14 at 5.00.57 PM.png (230×725 px, 35 KB)

This appears to be working properly.

SSH: We got this one for free since I built it in Phacility a long time ago, but SSH is being proxied by both boxes to the underlying repository service.

Remaining Work

Repositories: These aren't redundant yet and we'd currently lose them if secure001 exploded.

Updated Daemon UI: Current UI has a bunch of issues and does not make it clear where daemons are running.

SSH: This works in the Phacility cluster, but is wholly unrealistic for third-parties to configure today and has zero documentation. We also run a modified sshd, but shouldn't require installs to.

Goal	Ref	Time	Notes
Databases	D15714, D15716, D15717	0.5 Hours	Improve `bin/storage` awareness of replicas.
Production	T10784	3 Hours	Make web, database, daemons and notifications redundant on this host.
Subtotal		3.5 Hours
Cumulative Total		15.5 Hours

epriestley closed subtask T4571: Allow Phabricator to run in Read-Only Mode as Resolved.Apr 15 2016, 9:31 PM

Just some cleanup today. New daemon console now shows which hosts daemons are running on:

Screen Shot 2016-04-15 at 4.36.25 PM.png (299×805 px, 68 KB)

The setup for secure has been stable so far.

Goal	Ref	Time	Notes
Daemons	D15724	0.5 Hours	Improve daemon console UI for multi-host setups.
Subtotal		0.5 Hours
Cumulative Total		16 Hours

epriestley mentioned this in T10830: phabricator_repository.repository_uri Missing.Apr 18 2016, 11:54 AM

fcoelho added a subscriber: fcoelho.Apr 18 2016, 9:05 PM

epriestley added a revision: D15751: Share Env/CWD methods between Passthru + ExecFuture, expose Env.Apr 18 2016, 11:24 PM

filippog added a subscriber: filippog.Apr 19 2016, 9:26 AM

epriestley added a commit: rPHU1341c014ca7c: Share Env/CWD methods between Passthru + ExecFuture, expose Env.Apr 19 2016, 11:49 AM

secure is now running repositories in fully redundant multi-master mode. This appears to be working, although it has been live for about 20 minutes and survived about three pushes so far so I'm not yet brimming with confidence about it.

If this holds, I think that's about the end of the "hard" stuff. Still plenty of usability / documentation / cleanup / support / performance work remaining (for example, multi-master repositories are currently almost impossible to configure).

Goal	Ref	Time	Notes
Repositories	D15747, D15748, D15759	1 Hour	Repository bookkeeping improvements.
Repositories	D15751, D15752, D15754, D15755, D15757, D15758, D15761	4 Hours	Support multi-master repositories.
Subtotal		5 Hours
Cumulative Total		21 Hours

epriestley closed subtask T10784: Deploy secure002.phacility.net as Resolved.Apr 19 2016, 9:34 PM

epriestley added a revision: D15763: Fill in missing cluster database documentation.Apr 20 2016, 2:33 AM

epriestley added a revision: D15764: Add slightly more cluster repository documentation.Apr 20 2016, 3:16 AM

epriestley added a revision: D15765: Fix an issue with PHID/handle management in push logs.Apr 20 2016, 3:26 AM

epriestley added a commit: rPbab3690b547f: Fill in missing cluster database documentation.Apr 20 2016, 11:46 AM

epriestley added a commit: rP48b015a3fa83: Add slightly more cluster repository documentation.

epriestley added a commit: rPb9cf9e6f0db9: Fix an issue with PHID/handle management in push logs.

epriestley added a revision: D15766: Show "Last Writer" and "Last Write At" in the UI, add more documentation.Apr 20 2016, 12:43 PM

epriestley added a revision: D15768: Implement `bin/repository thaw` for unfreezing cluster repositories.Apr 20 2016, 3:02 PM

epriestley added a commit: rPbd4fb3c9fac0: Implement `bin/repository thaw` for unfreezing cluster repositories.Apr 20 2016, 5:46 PM

epriestley added a commit: rP11aa902bd1e7: Show "Last Writer" and "Last Write At" in the UI, add more documentation.Apr 20 2016, 5:50 PM

epriestley created subtask T10843: Don't write Celerity resources which aren't defined in the local map to cache.Apr 20 2016, 10:02 PM

epriestley closed subtask T10843: Don't write Celerity resources which aren't defined in the local map to cache as Resolved.Apr 21 2016, 6:57 PM

Mixed progress. Repository clustering has been stable in production since the last update, and appears to actually work, which is good. We've hit a few minor snags (like a missing sync-before-read in diffusion.querycommits) but all pretty much expected stuff and nothing troubling.

However, progress on making repository clustering plausible to deploy, configure and administrate has been slow. There's a lot of ground to cover and Diffusion is already hard to configure and this makes everything harder. I've made some progress that I'm happy with, but I've also made some efforts to reduce and simplify the existing complexity (particularly the sudo/ssh situation, which is particularly complex), without much success with it so far.

Beyond configuration, there's an issue of how to choose which cluster service to host a repository on when creating it. This overlaps heavily with T10748 since EditEngine + custom forms seem like they're probably the best solution, although the current rule ("if there's exactly one, choose that") is probably fine in the short term since I don't expect installs to be deploying multiple clusters for some time.

We had an unrelated production incident this morning that I'm still cleaning up today, and I want to focus on tying up loose ends tomorrow before the release cut, so this probably won't make too much more headway this week.

Here's the general state of the world I expect for this release:

Database Replication: Deploy freely. There are rough edges, but deploying this should be reasonable today and should strictly put installs in a more available state. The documentation is up to date (see Clustering: Databases).
Aphlict: Deploy cautiously. I think this still has a couple of wrinkles that need to be worked out but it is straightforward to configure and seems to be working properly for the most part.
Repositories: Do not deploy. They work but are impossible to configure.
Daemons: Must wait for repositories.
Web: Must wait for repositories.
SSH: Must wait for repositories.

Goal	Ref	Time	Notes
Repositories	D15766, D15768	1 Hour	Provide recovery documentation and tooling.
Repositories	D15778, D15772, D15765, D15764, D15763	1 Hour	Better monitoring, more documentation.
Web	D15775	0.5 Hours	Prevent CDN cache poisoning during deployments.
Subtotal		2.5 Hours
Cumulative Total		23.5 Hours

epriestley mentioned this in 2016 Week 17 (Very Late April).Apr 23 2016, 12:09 AM

epriestley mentioned this in T10866: Missing Information in Diffusion User Guide: Repo Hosting.Apr 25 2016, 2:39 PM

I'm still not really thrilled with where the documentation situation on repositories is, but it's getting a little more manageable.

Handling of potential link disruption during writes (discussed in T10860) should be better now, too. I think it should now be fairly hard to freeze repositories outside of an actual disaster which legitimately puts data at risk.

The ssh link got a lot more chatty:

$ git push
...
# Push received by "secure001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "secure002.phacility.net"...
# Acquired read lock immediately.
# Device "secure002.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "secure002.phacility.net".
Counting objects: 13, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (13/13), done.
Writing objects: 100% (13/13), 4.77 KiB | 0 bytes/s, done.
Total 13 (delta 11), reused 0 (delta 0)
# Released cluster write lock.
...

This is probably over-tuned a bit, but should make the operation more obvious and hopefully help debug any issues which do arise. We can quiet it down later.

Goal	Ref	Time	Notes
Repositories	D15783, D15786	0.5 Hours	Minor repository sync fixes.
Documentation	D15787, D15778, D15788, D15794, D15795, D15796, D15798	2 Hours	Correct and improve documentation.
Repositories	D15789, D15790, D15791, D15792	2 Hours	Make repositories chatty and more freeze-resistant when network links are interrupted during a push.
Subtotal		4.5 Hours
Cumulative Total		28 Hours

epriestley created subtask T10876: Make the setup warning count cache database-backed.Apr 26 2016, 1:32 PM

epriestley closed subtask T10876: Make the setup warning count cache database-backed as Resolved.Apr 26 2016, 5:03 PM

jdforrester added a subscriber: jdforrester.Apr 26 2016, 6:05 PM

epriestley created subtask T10883: Allow repository cluster nodes to be read-only.Apr 27 2016, 11:13 AM

epriestley created subtask T10884: Sort repository, database and notification services better (by network distance).Apr 27 2016, 11:16 AM

urzds added a subscriber: urzds.Apr 27 2016, 1:36 PM

epriestley mentioned this in T10748: Implement `diffusion.repository.edit`, for creating and editing repositories via the API.Apr 27 2016, 1:49 PM

epriestley mentioned this in T7820: Add "Closed, Duplicate" option to Status Action in Maniphest.Apr 27 2016, 11:45 PM

ox added a subscriber: ox.Apr 29 2016, 6:01 PM

epriestley mentioned this in Blog Post: Development Notes (2016 Week 18).Apr 30 2016, 2:06 AM

salvian awarded a token.May 4 2016, 3:10 AM

epriestley mentioned this in Z1336: General Chat.May 6 2016, 10:13 PM

T10748, which provides the new cluster management UI for repositories, is nearing completion. This makes cluster setup more manageable and at least provides a pathway forward for Almanac stuff in the future, although it does not yet make Almanac / service management explicitly selectable when creating a repository (for now, new repositories allocate on a random open service).

Most of what remains there is UI/UX and documentation; I expect to complete that, then return here and finish the repository/web/SSH documentation, then consolidate where things are today and plan where it makes the most sense to go next.

The last round of repository changes have been stable in production here on secure, and deployed to the Phacility cluster last week. The cluster is not running HA repository services yet (repository services are still single-node clusters) but it is hitting about 95% of the code (all the locking/versioning/proxying/etc).

yelirekim awarded a token.May 11 2016, 2:23 PM

T10748 is wrapping up, and repositories now show cluster status in the new "Storage" panel of the new UI:

Screen Shot 2016-05-11 at 4.39.13 PM.png (989×1 px, 136 KB)

(The "Last Writer" column is a bug which I'll fix shortly.)

Per above, I'm expecting to complete the remaining documentation which was waiting on this next, then do some PM'ish stuff to consolidate the state of the world so there's a clearer picture of exactly where things stand and what's expected to work now vs not work yet.

epriestley added a revision: D15893: Fix improper viewer for Git SSH cluster workflows.May 12 2016, 12:03 AM

epriestley added a commit: rP15f14d6c2f20: Fix improper viewer for Git SSH cluster workflows.May 12 2016, 1:02 AM

epriestley added a revision: D15897: Provide SSH host documentation, tweak/supplement cluster documentation.May 12 2016, 1:56 PM

I think this is the current state of the world. I'm planning to tackle the three major issues noted below (observed repository versioning, repository lock granularity, device enrollment behavior) and then try to wind down this phase, move all this followup work to a new "make clustering more powerful" sort of task, and then plan where to go from there based on which goals are most important (scalability vs multi-region vs read-only vs hg/svn vs drydock vs whatever crops up as installs actually deploy this stuff).

From this point, future work has few interconnections and can be prioritized and pursued mostly independently.

General

Today, cluster behavior has a greater emphasis on multi-tenancy and availability/loss-resistance than it does on scalability and multi-region deployments. A mixture of general changes and service-specific changes could improve its suiltability to these tasks in the future.

Databases

The database read-only failover mode still has a number of limitations, some of which may be fairly severe. Improving this mode could reduce the disruption associated with losing masters.

Database clusters only provide redundancy and availability today: no traffic is sent to the replica unless the master is unreachable. In the future, we could send some reads to the replica during normal operation. This likely has the greatest impact for open source installs with a large public userbase.

Database replicas must be promoted manually. I have no current plans to attempt automatic promotion because of the mortal danger that mistakes represent.

All service types other than databases failover automatically today.

(Various) T10769: Read-Only Mode Errata
(Future) It would be nice to be able to send read traffic to replicas even if the master is alive.

Repositories

Hosted Git repositories are in good shape, but other types of repositories still have limited or nonexistent support.

In the future, better administrative tools and smarter connection management could improve the behavior of multi-region clusters.

(Major) Observed repositories (vs hosted) do not version correctly.
(Future) No explicit management of which cluster service repositories allocate on.
(Future) No Mercurial support.
(Future) No Subversion support.
(Future) T10883: Allow repository cluster nodes to be read-only

Daemons and CLI

Some daemon and administrative behaviors still have rough edges.

(Major) Repository lock granularity for PullLocal daemon is too coarse.
(Major) T10940: Enrolling an existing device in a repository cluster has surprising effects and inconsistent severity
(Minor) PullLocal should prioritize by examining repository logical clocks.
(Minor) T6768: Worker queue lease names are unwieldy and could be better implemented
(Future) T10861: Provide a tool to rewind the push log for a repository

Notifications

Notification HTTP / HTTPS could be made easier to configure.

(Minor) T10402: Add a client setup check to detect that the browser is using HTTPS but Phabricator does not recognize it

Drydock

Drydock could be significantly better at detecting and recovering from broken resources and losts hosts than it is today. In particular, if you lose connectivity to a datacenter you're probably looking at some degree of manually pulling bad hosts out of the pool.

(Various) T8153: Improve detection and recovery when resources are mangled outside of Drydock's control

epriestley mentioned this in T4292: Implement repository replication.May 12 2016, 3:17 PM

epriestley added a commit: rP5162f8109525: Provide SSH host documentation, tweak/supplement cluster documentation.May 12 2016, 7:09 PM

epriestley mentioned this in 2016 Week 20 (Mid May).May 13 2016, 10:54 PM

Two of the three issues discussed above (lock granularity, severity of device enrollment) got tackled. I think I have an attack on versioning hosted repositories, but I'm not going to try to squeeze it in before the release cut today.

salvian added a subscriber: salvian.May 26 2016, 10:05 AM

epriestley closed subtask T4292: Implement repository replication as Resolved.May 30 2016, 5:14 PM

Herald added a subscriber: faulconbridge. · View Herald TranscriptMay 30 2016, 5:14 PM

Observed repositories should now version in a reasonable way.

I've created followups to track the other points above that didn't previously have dedicated tracking tasks:

This also wasn't explicitly covered above, but is effectively a followup:

T11044: Support partitioning application databases across multiple database hosts

I'm going to pause this for feedback since we have no more planned direct action here.

(We are pursuing application partitioning (T11044) and improved drydock recovery (T8153) outside of this, but I'll track those separately since they're only tangentially related to this.)

Goal	Ref	Time	Notes
Repositories	D15986	1 Hour	Version observed repositories in a reasonable way.
Subtotal		1 Hour
Cumulative Total		29 Hours

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 6 2016, 4:27 PM

This spawned a healthy set of followups, but I believe the core work described here is now resolved and accounted for and all future work is covered in narrower followup tasks.

epriestley closed subtask T10756: Make daemons work correctly no matter where they are or how many copies are running as Resolved.Feb 21 2017, 12:35 AM

urzds removed a subscriber: urzds.Jul 12 2017, 11:14 AM

urzds added a subscriber: urzds.

• pasik added a subscriber: • pasik.Dec 17 2017, 6:33 PM

epriestley closed subtask T10883: Allow repository cluster nodes to be read-only as Resolved.Apr 12 2018, 11:10 PM

	F1308735: Screen Shot 2016-05-11 at 4.39.13 PM.png
	May 11 2016, 11:46 PM

	F1220870: Screen Shot 2016-04-15 at 4.36.25 PM.png
	Apr 16 2016, 12:27 AM

	F1219317: Screen Shot 2016-04-14 at 4.58.20 PM.png
	Apr 15 2016, 12:09 AM

	F1219323: Screen Shot 2016-04-14 at 5.00.57 PM.png
	Apr 15 2016, 12:09 AM

	F1212532: Screen Shot 2016-04-09 at 3.23.06 PM.png
	Apr 10 2016, 4:00 AM

	F1212531: Screen Shot 2016-04-09 at 3.06.12 PM.png
	Apr 10 2016, 4:00 AM

	F1214413: severed.png
	Apr 11 2016, 3:43 PM

Make Phabricator Highly AvailableClosed, ResolvedPublicActions