Page MenuHomePhabricator

Support partitioning application databases across multiple database hosts
Closed, ResolvedPublic

Description

Phabricator conceptually supports putting application databases on different hosts: for example, separating applications with high load (maniphest, worker) or unusual access/usage/workload patterns (file) onto different machines to improve scalability or make administration easier.

In particular, we have planned for this since ancient times and already establish separate connections to each database and do not perform JOIN operations between databases, so the majority of database behaviors should already handle this change seamlessly. The cluster.databases configuration can also support selecting which databases are available on hosts.


What will need changes are administrative tools. In particular:

  • .php patches in autopatches/ will just work, but .sql autopatches do not currently have the metadata they need to identify which database to connect to. In most-dumb mode, we need to start adding comment headers to them.
  • quickstart.sql is a complete mess.
  • The behavior of bin/storage dump is somewhat unclear/ambiguous (probably: dump a specific master).
  • Perhaps bin/storage destroy should also destroy a specific master? This is easier to make go either way.
  • bin/storage adjust and "Database Status" need to be database/host aware.
  • Database setup checks likely need to be database/host aware.
  • This is a good opportunity to remove mysql.implementation and mysql.configuration-provider. The former we can infer and the latter will be obsolete.

A possible solution to the quickstart and migration problems is that we just maintain every schema on every database, and start tracking patches per-master. At first glance this is a little goofy/weird, but would possibly make patch management much easier, particularly when you decided to move databases.

Related Objects

Event Timeline

After a little thought, I'm leaning strongly to putting every schema on every database. I think this makes a lot of the administration, maintenance and implementation easier.

It also gives us a little more flexibility to do things like shard caches so they're writable on multiple masters and let the Eurasia web nodes interact with a local cache, and leaves the door open to more types of real sharding later. Although we have many application design constraints which assume we will never shard, it is not outside the realm of imagining that certain specific types of well-behaved, high-volume, mostly-accessed-by-ID data (like Nuance items, file metadata, and file raw data) might eventually be sharded. We can shard patch management data today (which shards naturally, since all of the data is relevant only to the local shard) to simplify administration, consider sharding cache data in the future (which shards easily), and move from there to other sharding as needs arise.

Thus, I think implementation looks like this:

  • Allow cluster.databases to define multiple masters. Initially, only the first master is ever used.
  • Make bin/storage and other administrative tools and UIs multi-master aware. In particular:
    • bin/storage needs to apply patches to every master.
    • "Database Status" and "Database Issues" need to check all masters and identify problems on specific masters.
    • "Database Servers" needs to examine all masters (likely trivial).

Once this is done, you'll be able to configure multiple masters and bin/storage will maintain the schemata on all of them properly, but the web UI will never send any normal reads or writes to the 2nd..Nth masters so they'll always just have an empty copy of every table.

Then we can introduce an applications property to let you specify "applications": {"files": true} or similar and move the "files" database to the second master. I think dealing with this at runtime should be pretty easy, but we likely need some new administration/support tools:

  • Identify when tables that should not have data on a given master have data?
  • Provide flags to bin/storage dump to make it easier to partition an application?
  • Write documentation about partitioning.
  • Maaaaaaybe let you put an individual application in read-only mode while performing a partition? At first glance, this feels like an extreme version of T10769 that no one would ever test properly. Dump + load downtime isn't generally that bad and partitioning should be exceedingly rare, so I'm not immediately inclined to try to pursue this. That said, it's not particularly difficult to pursue in theory.

D16115 + D16847 + D16848 make these changes:

  • cluster.databases may now have multiple "master" entries.
  • Connections are now established per-application, not globally.
  • The bin/storage tools are all aware of all configured database hosts.
  • The "Database Status" and "Database Issues" sections of Config are now aware of all configured masters.

This stuff remains to be done:

  • Currently, only the first master is ever selected for any read or write connection.
  • (That is, it is impossible to logically locate application databases on a different master.)

I'm going to make sure all the "multiple masters is legal" stuff works for a bit, then pursue moving logical databases to different hosts.

General state of the world here:

  • D16876, which supports partitioning, is landing momentarily.
  • Don't use this in a production environment yet unless you are extremely ambitious or your production is so utterly on fire that replacing a large fire with a series of small fires would be a net improvement.
  • If you do use it, maybe start with partitioning a low-value application like Slowvote.
  • I think this is very unlikely to do anything legitimately bad, but may have a few remaining bugs in the vein of T11893 where stuff goes over the wrong connection in unusual situations.
    • The major bad case here is probably writes somehow happening to multiple masters, which would leave you with two T123 tasks or whatever and no easy way to merge them. Although this sort of "data surplus" issue is probably less bad than actual data loss, it will probably cause just as many headaches.

I'm planning to do these things next:

  • Make the setup checks run against all configured database hosts (T10759).
  • Add a bin/storage okay-i-am-ready-you-can-use-these-partitions sort of command, which writes a copy of your current partition configuration to each database, storing what it is supposed to be doing in a configuration table on the database host.
  • When Phabricator starts up, I'll make it fatal if the local partition configuration and database partition configuration differ for any database (e.g., you have told it to write maniphest data to a database which does not think it should be receiving maniphest data). This should prevent hard-to-merge writes if someone accidentally brings up a web node with weird/out-of-date config.

Then I'm going to partition something on secure, probably files. This will give us a 4-node cluster doing a tiny amount of actual work, but whatever. After that's in production for a week or so without anything catching on fire I'll be more confident that this is generally usable.

Other/future stuff:

  • I'm going to add a pool flag or similar to each cluster.databases entry to support connection pooling (introduced in T11672, but disabled until this change completes). You can enable this to reduce outbound port pressure on active web nodes and slightly improve performance, at the cost of needing to raise max_connections and not run on a Raspberry Pi.
  • I'm going to plan what we need to do to support an overlay flag, which overlays logical connections to different databases which happen to go to the same physical database host onto the same physical connection. The major technical blocker for this is that all references to tables in all queries in the codebase need to explicitly incorporate references to the containing database. In the future, you'll be able to enable this to reduce the total number of connections and slightly improve performance. I may default it to "on" in the future; if I do, I will probably also default pooling to "on". The major reason to have this "off" was so that we didn't write any bad cross-application joins in the last 5 years which would have made this partitioning change a huge problem.
  • I'm going to remove the mysql.implementation option. The new behavior will be: MySQLi if installed; MySQL otherwise; fatal if neither work. If anyone has real use cases for something else, we can easily modularize this fully.
  • I'm going to remove the mysql.configuration-provider option. I believe this option hasn't done anything useful for years. Yell if you still rely on it and I can phase it out more gradually. It doesn't need to get removed to unblock anything here, I just believe this set of changes completely obsoletes everything it was ever intended to support.

We have at least one tricky issue remaining: when applying storage upgrades, we currently apply them like this:

foreach ($masters as $master) {
  foreach ($migration as $migration) {
    apply_migration($master, $migration);
  }
}

This won't work any more, because migration N may be a PHP migration which requires schema migration N-1 to be in force on all masters already. The list of migrations may be arbitrarily long (installs may wait years between upgrades) so we can't reasonably avoid this by just writing migrations carefully.

This needs to be rewritten to work like this:

foreach ($migrations as $migration) {
  foreach ($masters as $master) {
    apply_migration($master, $migration);
  }
}

...which is not that bad, but not a trivial sort of documentation/setup thing like all the other stuff above.

This does sort of let us improve one thing: currently, PHP migrations need to be idempotent because we'll apply them to each master. Although this is fine, and it's good for them to be idempotent for other reasons, and we try to write them in an idempotent way anyway, we can make this loop apply PHP migrations exactly once, then write that the patch has been applied to all masters. Then, if we have a non-idempotent migration for some reason, we'll survive it without damage.

  • T11908 is a followup for executing queries for multiple applications on a single connection. I believe the pathway for that is straightforward and fairly short, but that no install would really reap substantial benefits from it today, so I don't expect to pursue it for some time.
  • I believe everything else is now complete, so I'll put this in production as soon as everything here lands and we can see what catches on fire.

I'm partitioning secure.phabricator.com now. Things will drop into read-only mode for a bit.

Here's what I've done so far:

Old state was: 001 is master, 002 is replica.

  • Upgraded existing hosts to HEAD.
  • Deployed a new configuration, making these changes:
    • Added a "default" partition to 001.
    • Added new "master" config to 002, pointing at 001.
    • Added 003, with an empty partition.
  • Deployed changes.
  • Ran bin/storage upgrade (hit some issues here).
  • Ran bin/storage partition.
  • Restated webserver.

Now, I verified that everything still worked normally. Phabricator now knows about secure003 and it has the schema, but it is never sending any traffic there.

Next, to actualy move the application:

  • Put secure001 and secure002 in cluster.read-only mode.
  • Stopped the daemons.
  • Dumped the file database (here, secure_file; by default, phabricator_file):
$ mysqldump --hex-blob --single-transaction --databases secure_file | gzip > files.sql.gz
  • Loaded the database onto secure003.
  • Added file to the "partition" configuration for secure003.
  • Ran bin/storage partition.
  • Restarted webserver.

We're still in read-only mode. Verified that files connections go to secure003 and the sever is receiving traffic.

  • Removed cluster.read-only.
  • Restarted webserver.

Now we're operating normally with the Files application partitioned.

  • Wrote new files, verified they went to secure003.

Next, I'll configure 003 to replicate to 004, but I'm going to fix the upgrade bug I hit first.

Okay, we're headed back into readonly mode shortly to set up replication. I'm going to verify D16916 along the way so there may be some "partitions disagree about life" errors.

I configured 003 to replicate to 004:

  • Back in readonly mode.
  • bin/storage dump --for-replica --compress --output --host ... on 003.
  • Load that onto 004.
  • Configure 004 so it's replicating properly.
  • Add 004 to replication config in cluster.databases.
  • bin/storage partition
  • Out of readonly mode.
  • Restart webserver + daemons.

Other errata:

  • New hosts needed Almanac entries before PullLocal was happy about starting.
  • Set up backups on 003, but our backup stuff runs bin/storage dump and needs to become aware that it should only be dumping the current host. This doesn't apply in general.

This seems to be working now. I'm going to let it sit in production for a while and see if any issues crop up before considering it resolved, but it seems like everything is working smoothly.

Here's the final production configuration:

'cluster.databases' => array(
  array(
    'role' => 'master',
    'host' => 'secure001.phacility.net',
    'port' => 3306,
    'user' => 'secure',
    'partition' => array(
      'default',
    ),
  ),
  array(
    'role' => 'replica',
    'host' => 'secure002.phacility.net',
    'master' => 'secure001.phacility.net:3306',
    'port' => 3306,
    'user' => 'secure',
  ),
  array(
    'role' => 'master',
    'host' => 'secure003.phacility.net',
    'port' => 3306,
    'user' => 'secure',
    'partition' => array(
      'file',
    ),
  ),
  array(
    'role' => 'replica',
    'host' => 'secure004.phacility.net',
    'port' => 3306,
    'user' => 'secure',
    'master' => 'secure003.phacility.net:3306',
  ),
),
epriestley claimed this task.

This has been running cleanly in production for roughly two weeks, and appears stable. secure001 stopped writing Files data at F1943597 and we're now at F2078289 on secure003. We saw a couple of minor setup issues (mostly: exception messages not being tailored enough) but no fundamental issues.

T11908 is a followup for an "overlay" connection mode which is possible now that this change has been made.