Page MenuHomePhabricator

Sort repository, database and notification services better (by network distance)
Open, NormalPublic

Description

Currently, when choosing a repository, database, or notification server to connect to, we primarily shuffle() all valid servers and choose randomly.

This isn't usually a bad way to rank them, and is probably in the ballpark of optimal when all servers are identical. However, if you deploy a cluster across multiple regions, some of the servers may be much closer to you on the network than others are, and it would often be better to try those first and send relatively little traffic to nonregional datacenters.

If the only additional input to the ranking algorithm that we want to use is network latency, it might be easiest to detect this automatically rather than requiring users to configure it. We can keep a record of how long it took to connect to servers from the current host, and favor nearby servers over distant servers based on actual measured latency.

However, this is a bit magical and there may be some other use cases for selecting a different server? Maybe? I don't have a great one, really:

  • Implicit magic is often bad; this is pretty magical.
  • Gut feeling is that network latency almost certainly isn't the only reasonable concern here.
  • You have a primary office in California, and a satellite office in Antartica. You want Antartica to write to California so that the penguin engineers bear the full cost of their high latency and minimally disrupt the California office, where 99% of your engineering headcount is. (If there are primarily-Antarctic repositories, they could be on a second cluster configured with Antarctic masters).

It would be nice to get a better sense of use cases where "use measured latency" gets the wrong result before designing a solution here.

Event Timeline

See also T10883 for related work: making some nodes read-only.

I don't currently expect anyone to be deploying multi-region clusters in the first iteration of real clustering, so this isn't high on my list of things to fix in this round.

Another possible use case is that, e.g., when selecting admin notification servers, you may not have a network path to some of them whatsoever (because they're on a different subnet in a different datacenter). You can currently deal with this by deploying different configuration in each datacenter, and some pathway between the two datacenters must exist so that you can receive notifications from both, but it's possible that ranking tools could help with use cases in this vein.

I don't currently expect anyone to be deploying multi-region clusters in the first iteration of real clustering

Hah, we were actually talking about this today to allow us to deploy build agents in regions closer to where deployments are occurring (so we don't have to transfer large binary blobs across regions during deployment preparations), and how we could use multi-master repositories to geo-replicate the repository data close to the build agents.

Gut feeling is that network latency almost certainly isn't the only reasonable concern here.

For large pieces of data like repositories, network latency isn't as important as available bandwidth speed. The issue is that an IP address can have different available bandwidth speeds from different data centres, because even though it might appear to be on the same local network (or you have routes configured), the VPN might actually be jumping over the internet.

Realistically the only thing we care about geo-replicating is the repositories though because of their size (and you always want to be pulling them from a copy in the local availability zone instead of through a VPN over the internet), we'd never be geo-replicating the web tier or anything like that. So I think the main thing for us would be to have the SSH proxy box pick the right repository storage (the one that's in the same LAN as the SSH proxy box)? I'm not sure, these are just things I'm thinking about right now.

Is "use measured latency" probably always the correct rule in your (hypothetical) cluster setup? Or do you have any cases offhand where it isn't the desired result, and you really want to send traffic to a more distant node on the network?

Ah, link speed vs latency is interesting.

We would always measure latency as <src, dst> so if an IP has different bandwidth it should (usually? hopefully?) also have higher latency from the host, although I could sort-of-plausibly imagine that you might connect your datacenters partially by satellite and partially by dial-up or something. Maybe.

There's probably almost certainly a correlation there that link speed increases with lower latency because of proximity? But having an SSH proxy accidentally pick a link that's say in Oregon instead of North California because at the time of sampling some local network issues were being had that made the local copy appear worse would be a reasonably bad situation for a few reasons:

  • If a link is 10Mbps over the internet vs 100Mbps over local same AZ network, then your build agent is going to take 10 times longer to clone. When you have a repository that takes 2-3 minutes to clone over 100Mbps, accidentally adding another 18 minutes onto a build because the system picked the wrong box is really less than ideal.
  • Amazon charges you money for transferring data across regions, so it's always better to just synchronise things once using geo-replicating and then never do any other cross-region clones (especially because geo-replication sync is only the delta, where as a cross-region clone to a build machine would be the full repository size).

I mean none of these things are show-stoppers in the example above, but you can imagine how bad things would be over a link between the Sydney data centre and N. California if it was happening all the time.

Yeah, measuring latency and using it as a signal is tricky. If there's a long disruption and Oregon measures faster than N. California for "a while", we probably legitimately want to use it. If N. California completely goes down, we almost certainly want to keep things alive and send writes to Mars, even if Mars is at a huge network distance, since "slow write" is probably almost always better than "complete failure" (although maybe Mars bandwith is so much more expensive that we don't want to fail over to it, or the size of the link is small enough that it simply can't accommodate the failover traffic).

But if there's a temporary disruption, we don't want to stick the network in some crazy send-traffic-across-the-planet mode for a long time just because a couple of measurements were bad.

We also don't want to measure two identical nodes as "1.1ms" and "1.2ms" away and then send all of the traffic to the "closer" one instead of distributing it evenly. We should recognize that these nodes are "pretty much the same" and select randomly.

But the right values of "a while" and "pretty much the same" and the various other heuristics this would need to use are always going to be a bit fuzzy.

See PHI860 and T13111. In the future, repository nodes may automatically gc/prune/repack. If they do, it may make sense to sort them to the bottom of the list so traffic is sent to them only if no other nodes are available, in order to minimize the impact that gc/prune/repack have on other activity.