See T11665 for context: several above-average-sized active instances have ended up co-hosted on the same shard, which is unevenly loading that shard.
A good tool here would be to be able to pick the biggest instance and move it to a new, empty shard.
In general, moving instances between shards is something we need in the long run, and need in a more general sense for private clustering (T11230).
In theory, we would do this seamlessly by configuring double writes, making the new shard authoritative, and turning off the configuration pointing at the old shard. In practice, I think this won't work well. It would mean that the database server was replicating only some tables (and possibly a master for some databases and replica for other databases, at least briefly), which seems fairly fragile. This also requires MySQL service restarts, which impact all instances instead of only the moving instance.
I think a "stop the world" approach is likely better: suspend the instance, copy the data, swap the config, restore the instance. This results in some downtime for the instance, but should generally be simpler and cleaner.
I'm planning to:
- bring up a new shard;
- migrate a test instance;
- document the process;
- migrate a real upstream instance;
- migrate a real user instance.
I'll automate this if possible, but I suspect we may need to refine the process a bit first.