Support partitioning application databases across multiple database hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	May 26 2016, 8:06 PM

Description

Phabricator conceptually supports putting application databases on different hosts: for example, separating applications with high load (maniphest, worker) or unusual access/usage/workload patterns (file) onto different machines to improve scalability or make administration easier.

In particular, we have planned for this since ancient times and already establish separate connections to each database and do not perform JOIN operations between databases, so the majority of database behaviors should already handle this change seamlessly. The cluster.databases configuration can also support selecting which databases are available on hosts.

What will need changes are administrative tools. In particular:

.php patches in autopatches/ will just work, but .sql autopatches do not currently have the metadata they need to identify which database to connect to. In most-dumb mode, we need to start adding comment headers to them.
quickstart.sql is a complete mess.
The behavior of bin/storage dump is somewhat unclear/ambiguous (probably: dump a specific master).
Perhaps bin/storage destroy should also destroy a specific master? This is easier to make go either way.
bin/storage adjust and "Database Status" need to be database/host aware.
Database setup checks likely need to be database/host aware.
This is a good opportunity to remove mysql.implementation and mysql.configuration-provider. The former we can infer and the latter will be obsolete.

A possible solution to the quickstart and migration problems is that we just maintain every schema on every database, and start tracking patches per-master. At first glance this is a little goofy/weird, but would possibly make patch management much easier, particularly when you decided to move databases.

Revisions and Commits

rP Phabricator
	D16930	rPf19924310431 Clean up another insufficiently-general exception
	D16924	rPce18a8e20820 Fix two setup issues arising from partitioning support
	D16916	rP0ed767b967b3 Fix a couple of partition migration bugs
	D16913	rP8c89fc38fc42 Allow persistent connections to be configured per database host
	D16912	rPf89f7086922c Apply storage patches patch-by-patch, not database-by-database
	D16911	rPe6bfa1bd2388 Remove "mysql.configuration-provider" configuration option
	D16910	rP4da74166fe87 When storage is partitioned, refuse to serve requests unless web and databases…
	D16909	rPbac27fb40363 Remove "mysql.implementation" configuration
	D16876	rP55e21565b593 Support application partitioning across multiple masters
	D16848	rP558d19430233 Update `bin/storage` workflows to accommodate multiple masters
	D16847	rPbc15eee3f2c5 Update SchemaQuery and the web UI to accommodate multiple master databases
	D16115	rPecc598f18dc4 Support multiple database masters and convert easy callers

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T11044 Support partitioning application databases across multiple database hosts
Resolved	epriestley	T10759 Run PhabricatorDatabase/MySQLSetupCheck against all configured replicas
Resolved	epriestley	T11672 Evaluate persistent connections from HTTP contexts

Event Timeline

epriestley created this task.May 26 2016, 8:06 PM

Herald added a subscriber: eadler. · View Herald TranscriptMay 26 2016, 8:06 PM

ftdysa added a subscriber: ftdysa.May 27 2016, 12:05 AM

epriestley moved this task from Backlog to Preflight on the Prioritized board.May 27 2016, 2:17 AM

epriestley mentioned this in T11056: Send some database traffic to replicas even while the master is still alive.May 30 2016, 5:53 PM

epriestley mentioned this in T10751: Make Phabricator Highly Available.May 30 2016, 6:23 PM

epriestley mentioned this in T10849: Support query timeouts via `max_statement_time` or MYSLQI_ASYNC.Jun 5 2016, 10:21 PM

After a little thought, I'm leaning strongly to putting every schema on every database. I think this makes a lot of the administration, maintenance and implementation easier.

It also gives us a little more flexibility to do things like shard caches so they're writable on multiple masters and let the Eurasia web nodes interact with a local cache, and leaves the door open to more types of real sharding later. Although we have many application design constraints which assume we will never shard, it is not outside the realm of imagining that certain specific types of well-behaved, high-volume, mostly-accessed-by-ID data (like Nuance items, file metadata, and file raw data) might eventually be sharded. We can shard patch management data today (which shards naturally, since all of the data is relevant only to the local shard) to simplify administration, consider sharding cache data in the future (which shards easily), and move from there to other sharding as needs arise.

Thus, I think implementation looks like this:

Allow cluster.databases to define multiple masters. Initially, only the first master is ever used.
Make bin/storage and other administrative tools and UIs multi-master aware. In particular:
- bin/storage needs to apply patches to every master.
- "Database Status" and "Database Issues" need to check all masters and identify problems on specific masters.
- "Database Servers" needs to examine all masters (likely trivial).

Once this is done, you'll be able to configure multiple masters and bin/storage will maintain the schemata on all of them properly, but the web UI will never send any normal reads or writes to the 2nd..Nth masters so they'll always just have an empty copy of every table.

Then we can introduce an applications property to let you specify "applications": {"files": true} or similar and move the "files" database to the second master. I think dealing with this at runtime should be pretty easy, but we likely need some new administration/support tools:

Identify when tables that should not have data on a given master have data?
Provide flags to bin/storage dump to make it easier to partition an application?
Write documentation about partitioning.
Maaaaaaybe let you put an individual application in read-only mode while performing a partition? At first glance, this feels like an extreme version of T10769 that no one would ever test properly. Dump + load downtime isn't generally that bad and partitioning should be exceedingly rare, so I'm not immediately inclined to try to pursue this. That said, it's not particularly difficult to pursue in theory.

epriestley added a revision: D16115: Support multiple database masters and convert easy callers.Jun 14 2016, 6:15 PM

epriestley moved this task from Preflight to The Queue on the Prioritized board.Jun 15 2016, 1:09 PM

epriestley added a subtask: T10759: Run PhabricatorDatabase/MySQLSetupCheck against all configured replicas.Jun 16 2016, 12:30 PM

epriestley mentioned this in T10759: Run PhabricatorDatabase/MySQLSetupCheck against all configured replicas.

epriestley mentioned this in D16194: Provide a missing timeout on the non-cluster connection pathway.Jun 29 2016, 6:18 PM

epriestley mentioned this in rP9827cc16227d: Provide a missing timeout on the non-cluster connection pathway.Jun 29 2016, 6:20 PM

epriestley mentioned this in T11232: MySQL connection may not respect timeout?.Jun 29 2016, 6:21 PM

epriestley mentioned this in T9828: Make migrating files less scary.Jul 18 2016, 7:49 PM

scode added a subscriber: scode.Aug 11 2016, 2:58 AM

epriestley mentioned this in T11458: Make daemons less aggressive about cycling connections.Aug 11 2016, 3:04 AM

epriestley mentioned this in T11589: Phabricator configuration with invalid MySQL credentials isn't getting to the help messages.Sep 5 2016, 4:36 PM

epriestley mentioned this in T11672: Evaluate persistent connections from HTTP contexts.Sep 20 2016, 1:54 PM

joshuaspence added a subscriber: joshuaspence.Sep 22 2016, 12:51 PM

epriestley mentioned this in D16591: For now, disable persistent connections and the "max_connections" setup warning.Sep 23 2016, 7:40 PM

epriestley mentioned this in rP38b10f05a224: For now, disable persistent connections and the "max_connections" setup warning.Sep 23 2016, 7:42 PM

epriestley mentioned this in Z1336: General Chat.Sep 23 2016, 8:21 PM

epriestley added a revision: D16847: Update SchemaQuery and the web UI to accommodate multiple master databases.Nov 12 2016, 5:27 PM

epriestley added a revision: D16848: Update `bin/storage` workflows to accommodate multiple masters.Nov 12 2016, 8:05 PM

epriestley added a commit: rPecc598f18dc4: Support multiple database masters and convert easy callers.Nov 13 2016, 12:30 AM

epriestley added a commit: rPbc15eee3f2c5: Update SchemaQuery and the web UI to accommodate multiple master databases.Nov 13 2016, 12:37 AM

epriestley added a commit: rP558d19430233: Update `bin/storage` workflows to accommodate multiple masters.

D16115 + D16847 + D16848 make these changes:

cluster.databases may now have multiple "master" entries.
Connections are now established per-application, not globally.
The bin/storage tools are all aware of all configured database hosts.
The "Database Status" and "Database Issues" sections of Config are now aware of all configured masters.

This stuff remains to be done:

Currently, only the first master is ever selected for any read or write connection.
(That is, it is impossible to logically locate application databases on a different master.)

I'm going to make sure all the "multiple masters is legal" stuff works for a bit, then pursue moving logical databases to different hosts.

epriestley added a revision: D16876: Support application partitioning across multiple masters.Nov 16 2016, 1:17 AM

epriestley mentioned this in T11893: storage upgrade claims it lacks permissions, but is running as root mysql user.Nov 19 2016, 3:42 PM

General state of the world here:

D16876, which supports partitioning, is landing momentarily.
Don't use this in a production environment yet unless you are extremely ambitious or your production is so utterly on fire that replacing a large fire with a series of small fires would be a net improvement.
If you do use it, maybe start with partitioning a low-value application like Slowvote.
I think this is very unlikely to do anything legitimately bad, but may have a few remaining bugs in the vein of T11893 where stuff goes over the wrong connection in unusual situations.
- The major bad case here is probably writes somehow happening to multiple masters, which would leave you with two T123 tasks or whatever and no easy way to merge them. Although this sort of "data surplus" issue is probably less bad than actual data loss, it will probably cause just as many headaches.

I'm planning to do these things next:

Make the setup checks run against all configured database hosts (T10759).
Add a bin/storage okay-i-am-ready-you-can-use-these-partitions sort of command, which writes a copy of your current partition configuration to each database, storing what it is supposed to be doing in a configuration table on the database host.
When Phabricator starts up, I'll make it fatal if the local partition configuration and database partition configuration differ for any database (e.g., you have told it to write maniphest data to a database which does not think it should be receiving maniphest data). This should prevent hard-to-merge writes if someone accidentally brings up a web node with weird/out-of-date config.

Then I'm going to partition something on secure, probably files. This will give us a 4-node cluster doing a tiny amount of actual work, but whatever. After that's in production for a week or so without anything catching on fire I'll be more confident that this is generally usable.

Other/future stuff:

I'm going to add a pool flag or similar to each cluster.databases entry to support connection pooling (introduced in T11672, but disabled until this change completes). You can enable this to reduce outbound port pressure on active web nodes and slightly improve performance, at the cost of needing to raise max_connections and not run on a Raspberry Pi.
I'm going to plan what we need to do to support an overlay flag, which overlays logical connections to different databases which happen to go to the same physical database host onto the same physical connection. The major technical blocker for this is that all references to tables in all queries in the codebase need to explicitly incorporate references to the containing database. In the future, you'll be able to enable this to reduce the total number of connections and slightly improve performance. I may default it to "on" in the future; if I do, I will probably also default pooling to "on". The major reason to have this "off" was so that we didn't write any bad cross-application joins in the last 5 years which would have made this partitioning change a huge problem.
I'm going to remove the mysql.implementation option. The new behavior will be: MySQLi if installed; MySQL otherwise; fatal if neither work. If anyone has real use cases for something else, we can easily modularize this fully.
I'm going to remove the mysql.configuration-provider option. I believe this option hasn't done anything useful for years. Yell if you still rely on it and I can phase it out more gradually. It doesn't need to get removed to unblock anything here, I just believe this set of changes completely obsoletes everything it was ever intended to support.

epriestley added a commit: rP55e21565b593: Support application partitioning across multiple masters.Nov 19 2016, 10:14 PM

We have at least one tricky issue remaining: when applying storage upgrades, we currently apply them like this:

foreach ($masters as $master) {
  foreach ($migration as $migration) {
    apply_migration($master, $migration);
  }
}

This won't work any more, because migration N may be a PHP migration which requires schema migration N-1 to be in force on all masters already. The list of migrations may be arbitrarily long (installs may wait years between upgrades) so we can't reasonably avoid this by just writing migrations carefully.

This needs to be rewritten to work like this:

foreach ($migrations as $migration) {
  foreach ($masters as $master) {
    apply_migration($master, $migration);
  }
}

...which is not that bad, but not a trivial sort of documentation/setup thing like all the other stuff above.

This does sort of let us improve one thing: currently, PHP migrations need to be idempotent because we'll apply them to each master. Although this is fine, and it's good for them to be idempotent for other reasons, and we try to write them in an idempotent way anyway, we can make this loop apply PHP migrations exactly once, then write that the patch has been applied to all masters. Then, if we have a non-idempotent migration for some reason, we'll survive it without damage.

epriestley added a subtask: T11672: Evaluate persistent connections from HTTP contexts.Nov 21 2016, 4:01 PM

epriestley closed subtask T10759: Run PhabricatorDatabase/MySQLSetupCheck against all configured replicas as Resolved.Nov 21 2016, 11:55 PM

epriestley added a revision: D16909: Remove "mysql.implementation" configuration.Nov 22 2016, 1:05 AM

epriestley added a revision: D16910: When storage is partitioned, refuse to serve requests unless web and databases agree on partitioning.Nov 22 2016, 2:08 AM

epriestley added a commit: rPbac27fb40363: Remove "mysql.implementation" configuration.Nov 22 2016, 12:15 PM

epriestley added a commit: rP4da74166fe87: When storage is partitioned, refuse to serve requests unless web and databases….

epriestley added a revision: D16911: Remove "mysql.configuration-provider" configuration option.Nov 22 2016, 12:36 PM

epriestley added a revision: D16912: Apply storage patches patch-by-patch, not database-by-database.Nov 22 2016, 2:07 PM

epriestley mentioned this in T11908: Support an "overlay" database connection mode where multiple applications share a single connection.Nov 22 2016, 2:54 PM

epriestley added a revision: D16913: Allow persistent connections to be configured per database host.

T11908 is a followup for executing queries for multiple applications on a single connection. I believe the pathway for that is straightforward and fairly short, but that no install would really reap substantial benefits from it today, so I don't expect to pursue it for some time.
I believe everything else is now complete, so I'll put this in production as soon as everything here lands and we can see what catches on fire.

epriestley added a commit: rPe6bfa1bd2388: Remove "mysql.configuration-provider" configuration option.Nov 22 2016, 5:24 PM

epriestley added a commit: rPf89f7086922c: Apply storage patches patch-by-patch, not database-by-database.

I'm partitioning secure.phabricator.com now. Things will drop into read-only mode for a bit.

Here's what I've done so far:

Old state was: 001 is master, 002 is replica.

Upgraded existing hosts to HEAD.
Deployed a new configuration, making these changes:
- Added a "default" partition to 001.
- Added new "master" config to 002, pointing at 001.
- Added 003, with an empty partition.
Deployed changes.
Ran bin/storage upgrade (hit some issues here).
Ran bin/storage partition.
Restated webserver.

Now, I verified that everything still worked normally. Phabricator now knows about secure003 and it has the schema, but it is never sending any traffic there.

Next, to actualy move the application:

Put secure001 and secure002 in cluster.read-only mode.
Stopped the daemons.
Dumped the file database (here, secure_file; by default, phabricator_file):

$ mysqldump --hex-blob --single-transaction --databases secure_file | gzip > files.sql.gz

Loaded the database onto secure003.
Added file to the "partition" configuration for secure003.
Ran bin/storage partition.
Restarted webserver.

We're still in read-only mode. Verified that files connections go to secure003 and the sever is receiving traffic.

Removed cluster.read-only.
Restarted webserver.

Now we're operating normally with the Files application partitioned.

Wrote new files, verified they went to secure003.

Next, I'll configure 003 to replicate to 004, but I'm going to fix the upgrade bug I hit first.

epriestley added a revision: D16916: Fix a couple of partition migration bugs.Nov 22 2016, 6:52 PM

epriestley closed subtask T11672: Evaluate persistent connections from HTTP contexts as Resolved.Nov 22 2016, 6:57 PM

epriestley added a commit: rP8c89fc38fc42: Allow persistent connections to be configured per database host.

epriestley added a commit: rP0ed767b967b3: Fix a couple of partition migration bugs.

Okay, we're headed back into readonly mode shortly to set up replication. I'm going to verify D16916 along the way so there may be some "partitions disagree about life" errors.

I configured 003 to replicate to 004:

Back in readonly mode.
bin/storage dump --for-replica --compress --output --host ... on 003.
Load that onto 004.
Configure 004 so it's replicating properly.
Add 004 to replication config in cluster.databases.
bin/storage partition
Out of readonly mode.
Restart webserver + daemons.

Other errata:

New hosts needed Almanac entries before PullLocal was happy about starting.
Set up backups on 003, but our backup stuff runs bin/storage dump and needs to become aware that it should only be dumping the current host. This doesn't apply in general.

This seems to be working now. I'm going to let it sit in production for a while and see if any issues crop up before considering it resolved, but it seems like everything is working smoothly.

Here's the final production configuration:

'cluster.databases' => array(
  array(
    'role' => 'master',
    'host' => 'secure001.phacility.net',
    'port' => 3306,
    'user' => 'secure',
    'partition' => array(
      'default',
    ),
  ),
  array(
    'role' => 'replica',
    'host' => 'secure002.phacility.net',
    'master' => 'secure001.phacility.net:3306',
    'port' => 3306,
    'user' => 'secure',
  ),
  array(
    'role' => 'master',
    'host' => 'secure003.phacility.net',
    'port' => 3306,
    'user' => 'secure',
    'partition' => array(
      'file',
    ),
  ),
  array(
    'role' => 'replica',
    'host' => 'secure004.phacility.net',
    'port' => 3306,
    'user' => 'secure',
    'master' => 'secure003.phacility.net:3306',
  ),
),

epriestley added a revision: D16924: Fix two setup issues arising from partitioning support.Nov 23 2016, 12:47 PM

epriestley mentioned this in T11916: "storage upgrade" fails to acquire an adjustment lock on databases with long hostnames.Nov 23 2016, 12:57 PM

epriestley added a commit: rPce18a8e20820: Fix two setup issues arising from partitioning support.Nov 23 2016, 3:20 PM

epriestley added a revision: D16930: Clean up another insufficiently-general exception.Nov 23 2016, 6:31 PM

epriestley added a commit: rPf19924310431: Clean up another insufficiently-general exception.Nov 23 2016, 6:41 PM

epriestley mentioned this in T11917: Global default settings only apply to logged in users.Nov 23 2016, 10:27 PM

epriestley mentioned this in T11741: Quickstart can fail to initialize databases if MyISAM is not available (currently, only in Google Cloud).Nov 24 2016, 6:14 PM

epriestley mentioned this in 2016 Week 48 (Very Late November).Nov 25 2016, 11:07 PM

This has been running cleanly in production for roughly two weeks, and appears stable. secure001 stopped writing Files data at F1943597 and we're now at F2078289 on secure003. We saw a couple of minor setup issues (mostly: exception messages not being tailored enough) but no fundamental issues.

T11908 is a followup for an "overlay" connection mode which is possible now that this change has been made.

Support partitioning application databases across multiple database hostsClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Support partitioning application databases across multiple database hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...