I've been doing a whole lot of dump/restore recently. These operations take forever and are very obviously not limited by disk I/O, at least on the external side.
The dump process is already table-by-table, so it would be interesting to see if parallelizing these dumps improves performance or not: instead of serially dumping each table, dump N table simultaneously.
Loading could possibly work the same way, where we retain file-per-table dumps and load them in parallel.
There's some complexity involved in stitching the pieces together, but figuring out if this helps or not should be straightforward.