See PHI1206.
Suppose we have a 2-node repository cluster, with nodes A and B. Both nodes are fully synchronized, so they can both accept reads and writes.
We suffer a loss of node B.
For simplicity, assume initial routing of external traffic is not affected by this node failure: either the LB is doing some kind of health check on its own and pulls node B out of service as a termination node, or SSH traffic is terminated elsewhere (for example, on a separate web tier, as in the Phacility cluster). So inbound SSH read traffic terminates somewhere on a healthy node, then makes a routing decision to find a destination repository service node.
Today, reads will still be routed to node B (the node we've lost) half the time: we don't healthcheck nodes, so we don't know that we should route traffic away from B.
When traffic is routed to node B, the request will fail because the node is down.
From a user point of view, reads to a cluster where N% of nodes are dead will fail N% of the time until an operator intervenes and disables the downed nodes. This doesn't make the service unavailable, but does disrupt access to it.
We would prefer to:
- detect that a request to a repository destination has failed utterly and should be retried, then retry the request against another node; and
- maintain a stateful view of node health so we can route traffic away from downed nodes.
Deciding to Retry a Request
The primary difficulty with doing (1) is that when we run ssh destination-node -- git-upload-pack ..., it is difficult for us to distinguish between failures which should be retried and failures which should not be retried. The SSH subprocess may exit with an error code to indicate many different error conditions, some of which should be retried (like connection failures or a full disk condition) and some of which should not (like a bad repository URI, or an access permission issue). I believe we can not reliably distinguish between retry conditions and non-retry conditions only by examining the exit code. See also Command Line Exit Codes for general complaining about exit codes being useless for building reliable software.
Idempotency
Regardless of the reason for failure, we should never retry requests if the request may already have had a non-idempotent effect (generally, performed some kind of write).
When the remote command is git-upload-pack, the request is an idempotent read (if all is right with the world), so we can always safely retry it if we make a retry decision. The caveats are:
- if we incorrectly make a retry decision (for example, retry on bad repository URI), every node will fail and clients will sit through a series of retries. We can impose a retry limit to reduce the badness here.
- if we connect to node X and send bytes, then make a retry decision for node Y, we must replay the bytes we sent.
When the remote command is git-receive-pack or similar, the request is not idempotent and we can not safely retry it.
(Aside: In a pure technical sense, many Git repository writes are idempotent, but in the presence of commit hooks we should treat them as non-idempotent in all cases. Even if we were willing to ignore this, git push origin :x is not idempotent: the first call deletes x and succeeds; the second call fails and exits with an error code.)
In all cases, git-upload-pack writes before it reads, so we can cheat our way through this in Git with a rule like this:
- if we wrote bytes, never retry.
This may not be a good rule in all cases in the long term, but is safe for Git reads, which are the immediate concern.
Combined with a retry limit, this is a small, safe ruleset with reasonable behavior:
- if we wrote bytes, do not retry;
- if we have already retried 3 times, do not retry;
- otherwise, retry.
This will produce some very silly false-positive retries, e.g. 3x retries on bad URI or bad access, but is likely a reasonable starting point.
Refining False Positives
We have two motivations to refine false positives, where we incorrectly decide to perform a safe-but-pointless retry after a permanent failure like a request to a bad URI:
- we'll reduce silly client-visible behavior where you request /tourtle.git instead of /turtle.git and the server seems confused because it retires the request against several destination nodes; and
- we can use the retry decision as an input to healthcheck state.
We'd prefer not to use false positives to influence healthcheck behavior: if we do, requesting /tourtle.git over and over again will eventually mark nodes as down. This is very funny but not especially robust.
To refine false positives, we can:
- write a side-channel row into the database with a unique ID;
- make the request with --side-channel-id=XYZ git-upload-pack ...;
- have phabricator-ssh-exec mark the request as received once it starts.
If a request is marked as received, we assume any failure is an application failure and do not retry it (or, some day, let the destination node make a retry decision if destinations are able to detect retryable conditions like "full disk" or "NFS failure" in the future).
This means that authentication failures will be retried, but since this intracluster routing should never encounter an authentication failure if it's properly configured, this behavior is reasonable. Other unreceived requests should reliably indicate real node failures.