When out-of-date nodes receive a read or write, they must first synchronize (bring themselves up to date so they have the most recent version of the repository).
Before D19735, they do this in an underspecified way which probably tends to slam one node with all the traffic. After D19735, they use a random up-to-date node, which is probably good if several nodes are up-to-date.
However, in the case where there is only one up-to-date node, this may still slam one node with a large amount of read traffic: if the list only has one host, we can't spread traffic no matter how hard we shuffle it.
We could improve this approach like this:
(1) When a node receives a write, it marks itself as the root node for that write (the first node to receive that write).
(2) When we synchronize, we build a deterministic spanning tree across the cluster based on the cluster members and the root node. For example: sort all the nodes by device PHID, then starting at the root node for the write, insert the nodes into a tree which always fills breadth-first in a predictable way. When we get to the end of the list, wrap around to the beginning until we've inserted every node. This tree could be binary or use some other leaf factor, but when we're done building the tree we have a map of how we should synchronize which every node can construct independently in a predictable way. Then, we synchronize from our parent in the tree.
Suppose we have an 8-node cluster and node "E" received the write. We might build this tree (or any other tree, as long as "E" is the root and every node can build this tree in a predictable way given the knowledge that there are 8 nodes in the cluster and "E" must be the root):
E / \ F G / \ / \ H B A C / D
Then every node synchronizes against its parent. For example, if "H" receives a request, it synchronizes against "F". Either "F" is already up to date, or this triggers "F" to synchronize against "E".
This is probably close to optimal and doesn't seem very complex. The CLI can actually make it fairly observable, too, I think.
I think the only trick is that if we query "F" and another write comes in, "E" may no longer be the root. So "F" may need to ask some other node in the new tree (if we asked "F" based on version 997 of the tree, then we update to version 998, "F" may now need to ask "G" to sync). However, I think this can't deadlock: if "F" never updated, no node below "F" could possibly have taken a write for version 998, since no node below "F" could have reached version 997. And, after D19734, the next write will "almost always" go to "E" anyway and we'll keep the same tree. So this could get a little bit goofy but should rapidly converge toward something sane.
You can perhaps force it to deadlock by removing "F" from the cluster between when we route the sync for 997 and the write for 998 comes in. Then perhaps "D" receives the write, synchronizes directly from "E", uhhh, then the request arrives at "F" and the new tree tells it to synchronize from H? But the read lock on H is held by the request to F so maybe those nodes deadlock. This seems hard, can be made observable in the CLI, whoever is pushing can probably ^C to resolve it, and we could just make the tree-read abort if it sees that the version bumped to automatically converge to sanity.
I'd like D19735 to sit in production for a bit before pursuing this but I think it gets us pretty close to optimal on sync routing, concerns like network distance notwithstanding (and we can build network distance concerns into the tree algorithm).