Page MenuHomePhabricator

When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover
Open, NormalPublic

Description

See PHI1206.

Suppose we have a 2-node repository cluster, with nodes A and B. Both nodes are fully synchronized, so they can both accept reads and writes.

We suffer a loss of node B.

For simplicity, assume initial routing of external traffic is not affected by this node failure: either the LB is doing some kind of health check on its own and pulls node B out of service as a termination node, or SSH traffic is terminated elsewhere (for example, on a separate web tier, as in the Phacility cluster). So inbound SSH read traffic terminates somewhere on a healthy node, then makes a routing decision to find a destination repository service node.

Today, reads will still be routed to node B (the node we've lost) half the time: we don't healthcheck nodes, so we don't know that we should route traffic away from B.

When traffic is routed to node B, the request will fail because the node is down.

From a user point of view, reads to a cluster where N% of nodes are dead will fail N% of the time until an operator intervenes and disables the downed nodes. This doesn't make the service unavailable, but does disrupt access to it.

We would prefer to:

  1. detect that a request to a repository destination has failed utterly and should be retried, then retry the request against another node; and
  2. maintain a stateful view of node health so we can route traffic away from downed nodes.

Deciding to Retry a Request

The primary difficulty with doing (1) is that when we run ssh destination-node -- git-upload-pack ..., it is difficult for us to distinguish between failures which should be retried and failures which should not be retried. The SSH subprocess may exit with an error code to indicate many different error conditions, some of which should be retried (like connection failures or a full disk condition) and some of which should not (like a bad repository URI, or an access permission issue). I believe we can not reliably distinguish between retry conditions and non-retry conditions only by examining the exit code. See also Command Line Exit Codes for general complaining about exit codes being useless for building reliable software.

Idempotency

Regardless of the reason for failure, we should never retry requests if the request may already have had a non-idempotent effect (generally, performed some kind of write).

When the remote command is git-upload-pack, the request is an idempotent read (if all is right with the world), so we can always safely retry it if we make a retry decision. The caveats are:

  • if we incorrectly make a retry decision (for example, retry on bad repository URI), every node will fail and clients will sit through a series of retries. We can impose a retry limit to reduce the badness here.
  • if we connect to node X and send bytes, then make a retry decision for node Y, we must replay the bytes we sent.

When the remote command is git-receive-pack or similar, the request is not idempotent and we can not safely retry it.

(Aside: In a pure technical sense, many Git repository writes are idempotent, but in the presence of commit hooks we should treat them as non-idempotent in all cases. Even if we were willing to ignore this, git push origin :x is not idempotent: the first call deletes x and succeeds; the second call fails and exits with an error code.)

In all cases, git-upload-pack writes before it reads, so we can cheat our way through this in Git with a rule like this:

  • if we wrote bytes, never retry.

This may not be a good rule in all cases in the long term, but is safe for Git reads, which are the immediate concern.

Combined with a retry limit, this is a small, safe ruleset with reasonable behavior:

  • if we wrote bytes, do not retry;
  • if we have already retried 3 times, do not retry;
  • otherwise, retry.

This will produce some very silly false-positive retries, e.g. 3x retries on bad URI or bad access, but is likely a reasonable starting point.

Refining False Positives

We have two motivations to refine false positives, where we incorrectly decide to perform a safe-but-pointless retry after a permanent failure like a request to a bad URI:

  • we'll reduce silly client-visible behavior where you request /tourtle.git instead of /turtle.git and the server seems confused because it retires the request against several destination nodes; and
  • we can use the retry decision as an input to healthcheck state.

We'd prefer not to use false positives to influence healthcheck behavior: if we do, requesting /tourtle.git over and over again will eventually mark nodes as down. This is very funny but not especially robust.

To refine false positives, we can:

  • write a side-channel row into the database with a unique ID;
  • make the request with --side-channel-id=XYZ git-upload-pack ...;
  • have phabricator-ssh-exec mark the request as received once it starts.

If a request is marked as received, we assume any failure is an application failure and do not retry it (or, some day, let the destination node make a retry decision if destinations are able to detect retryable conditions like "full disk" or "NFS failure" in the future).

This means that authentication failures will be retried, but since this intracluster routing should never encounter an authentication failure if it's properly configured, this behavior is reasonable. Other unreceived requests should reliably indicate real node failures.

Event Timeline

epriestley triaged this task as Normal priority.May 10 2019, 5:24 PM
epriestley created this task.

we'll reduce silly client-visible behavior where you request /tourtle.git instead of /turtle.git and the server seems confused...

This specific type of failure doesn't make it to the retry code, since whichever outermost node takes the read figures out which repository you're reading (and must, because it must know this information in order to figure out which leaf nodes can actually serve the request).

$ git fetch ssh://epriestley@local.phacility.com/source/tourtle.git
phabricator-ssh-exec: No repository "tourtle" exists!

Since your request has to be in pretty good shape to make it as far as the retry code, the amount of silly behavior that's actually client-visible looks like it's very small in practice. There's likely still some, but I think none of it is common or easy to reach.


Earlier, I proposed these basic retry rules:

  • if we wrote bytes, do not retry;
  • if we have already retried 3 times, do not retry;
  • otherwise, retry.

For now, I am adding this rule:

  • if we read bytes, do not retry;

This helps make sure we're really selecting a strict subset of safe-to-retry requests as retry candidates.

  • if we have already retried 3 times, do not retry;

I've removed this requirement in D20777. Instead, we'll just try every candidate service once, so we can support an assertion like "every request which can be served by at least one node will be served".


Remaining pieces here are:

  • refine false positives by labeling and confirming connections (discussed in the task description above);
  • feed retry and connection information into health checks (see T13287) so operators have a clearer view of cluster health;
  • use health check information to improve the ordering of the readable node list, so we find a readable node faster on average.

The primary impacts here are operational (better observability of cluster failures) and quality-of-life (slightly faster pulls in down-node scenarios). These changes shouldn't impact whether requests succeed or fail.


There is a small set of requests which we will currently decline to retry, but which we could safely retry. I suspect these requests (at least, under git fetch) may be so rare that refining behavior here may not be worthwhile.

For example, if this happens:

  • The client connects and is routed to host A.
  • Host A sends half of the ref-list to the client.
  • Lightning strikes Host A, obliterating it.

...we could recover like this:

  • Connect to host B.
  • Replay any bytes the client sent to host A.
  • Read and discard bytes from host B until we've read a protocol string which is exactly equivalent to the bytes we've already sent to the client. (If host B responds with a different response, we must abort.)
  • Reconnect the client to the subprocess and continue.

This is very complicated and very fragile, and probably solves a problem that essentially no one ever experiences, so I suspect it isn't worth pursuing for git fetch. In other cases, where some parts of the protocol are more static/routine, it may be reasonable to support resume-after-handshake or similar.

Also remaining is to extend this behavior to the HTTP pathway (and to Mercurial/SVN, eventually).