Page MenuHomePhabricator

Evaluate performance impact of performing MySQL dump/restore in parallel
Open, WishlistPublic

Description

I've been doing a whole lot of dump/restore recently. These operations take forever and are very obviously not limited by disk I/O, at least on the external side.

The dump process is already table-by-table, so it would be interesting to see if parallelizing these dumps improves performance or not: instead of serially dumping each table, dump N table simultaneously.

Loading could possibly work the same way, where we retain file-per-table dumps and load them in parallel.

There's some complexity involved in stitching the pieces together, but figuring out if this helps or not should be straightforward.

Event Timeline

epriestley triaged this task as Wishlist priority.Aug 31 2019, 4:04 PM
epriestley created this task.

Anecdata: locally, using 2 subprocesses went twice as fast (~85s -> ~42s). 4 subprocesses chopped another ~20% of the time off (~42s > ~35s). It stopped getting faster at 4. However, the largest table took 24s, so even if this was completely parallelizable we wouldn't expect it to drop lower than that.

One catch here is that mysqldump takes about 70ms to start up, and it's not surprising that this part is parallelizable. We currently have 574 tables so this accounts for ~40s of runtime, which is about the size of the initial improvement. So this may be more of a flat 40-second performance improvement than a large relative improvement.

This:

./bin/storage databases | xargs time mysqldump '--hex-blob' '--single-transaction' '--default-character-set' 'utf8mb4' '-u' 'root' '-h' '127.0.0.1' '--max-allowed-packet' '1G' --databases -- > /dev/null

..comes out at 40s, even with a purged cache.

This is still sort of inconclusive because the dataset is dominated by one unusually large table (daemon.log_event). It looks like parallelizing things is at least an absolute 40s faster, and possibly also a relative 10-20% faster, but the relative difference isn't dramatic on this dataset.