On Git cluster read failure, retry safe requests


On Git cluster read failure, retry safe requests

Depends on D20775. Ref T13286. When a Git read request fails against a cluster and there are other nodes we could safely try, try more nodes.

We DO NOT retry the request if:

  • the client read anything;
  • the client wrote anything;
  • or we've already retried several times.

Although some requests where bytes went over the wire in either direction may be safe to retry, they're rare in practice under Git, and we'd need to puzzle out what state we can safely emit.

Since most types of failure result in an outright connection failure and this catches all of them, it's likely to almost always be sufficient in practice.

Test Plan:

  • Started a cluster with one up node and one down node, pulled it.
  • Half the time, hit the up node and got a clean pull.
  • Half the time, hit the down node and got a connection failure followed by a retry and a clean pull.
  • Forced $err = 1 so even successful attempts would retry.
  • On hitting the up node, got a "failure" and a decline to retry (bytes already written).
  • On hitting the down node, got a failure and a real retry.
  • (Note that, in both cases, "git pull" exits "0" after the valid wire transaction takes place, even though the remote exited non-zero. If the server gave Git everything it asked for, it doesn't seem to care if the server then exited with an error code.)

Maniphest Tasks: T13286

Differential Revision: https://secure.phabricator.com/D20776