Page MenuHomePhabricator

Instance reporting that synchronizeWorkingCopyBeforeRead() effectively fails
Open, Needs TriagePublic

Description

An instance is reporting errors like this in a clustered repository:

Command failed with error #128!
COMMAND
git grep --extended-regexp --null -n --no-color -e '' 'X' -- 'Y'

STDOUT
(empty)

STDERR
fatal: bad object X at [<phutil>/src/future/exec/ExecFuture.php:369]

...where X is the hash of a valid commit which should be reachable (and Y is a path, although this probably isn't particularly relevant).

This is consistent with synchronization before a read simply failing.

Event Timeline

I think T12393 and this probably have a similar set of root causes.

Clustering was built for hosted repositories first, and the model is cleaner there because updates are simpler. Updates are always via git push (or similar), and we acquire and hold a lock around the whole operation. So it's a traditional "acquire locks, make edits, release locks" sort of situation. Calculating versions is also easier: the version is always the previous version plus 1.

For observed repositories, updates are via git fetch from the remote, and calculating versions is currently relatively complicated. There are two broad lock issues:

  • The daemon git fetch does not hold the same locks that the ClusterEngine git fetch does, so two can incorrectly run simultaneously (likely the cause of T12393).
  • There is a fairly wide window between git fetch and the repository version getting bumped (possibly multiple seconds, I think), and we do not hold a lock for this. During this window, one node can see X as HEAD but another node can see Y as HEAD while still surviving a synchronizeWorkingCopyBeforeRead() operation (likely the cause of this).

I believe these problems are unique to observed, clustered repositories.

The first issue is not particularly concerning. It's annoying, but doesn't violate any guarantees made by clustering, and is likely easy to resolve.

The second issue violates guarantees, since it makes this sequence possible:

  • Query the API, see commit X (you hit a node which has done a git fetch but hasn't completed discovery yet, so X is on disk but not "known").
  • Query the API again (after the first query finishes), don't see commit X (that first node still hasn't completed discovery and bumped the version).

To make matters worse, update (which does the git fetch) and discover (which does the version bump) are currently in separate processes so holding a lock across them is inconvenient.

I think the pathway forward here is tied to some of the issues in T12296 -- basically, get git fetch and version bumps under the same cluster locks that cluster operations use. The sequence is probably something like this:

  • Do git ls-remote (which, curiously, seems to be significantly faster than git fetch even when git fetch is a no-op).
  • Do git for-each-ref (we could eventually cache this).
  • Compare the lists.
  • If the lists differ, grab locks.
  • git fetch
  • Probably bump the version unconditionally now: it is possible that the fetch was a no-op if:
    • something else did the fetch between the time we did git for-each-ref and grabbed the lock; or
    • someone force-pushed the remote to put it back into a previous state between the git ls-remote and the fetch (e,g., pushed, then immediately deleted, a branch).
    • ...but these should both be rare enough that the git for-each-ref isn't worthwhile on average. (We could also try to detect the first one on the cheap.)
    • ...but might be fine if we're filling a cache?
  • Release locks.
  • Also, remove versioning code from discover -- it's there only because it currently depends on the outcome of the discovery process.

This should make the cluster fetch process more efficient (see T12296) since we can do less work when it's a no-op (and less work to identify that it's a no-op) and the pathway from git fetch to version synchronization shorter and simpler. We should end up with slightly less lock code overall, too.

Do git ls-remote (which, curiously, seems to be significantly faster than git fetch even when git fetch is a no-op).

This is git by design. It's optimized to minimize network traffic at the expense of local IO. For fetches, it only wants to download the objects it needs, but to do that, it has to compare "What do I have" vs. "What do you have". This becomes even more obvious when you know that git fetch requires a bare/working copy and git ls-remote does not.

it has to compare "What do I have" vs. "What do you have"

Maybe you misread or I'm misunderstanding?

My expectation is that a no-op git fetch does one of these things:

  • Locally, runs git for-each-ref, then sends the list to the remote and says "here's what I have, what do you have?". When it realizes this list is empty, it exits.
  • Runs git ls-remote, then asks the local working copy "here's what the remote says it has, what do I need?". When it realizes this list is empty, it exits.

I don't see a way to build this list without sending the list of refs over the wire in one direction or the other. Am I just not thinking very creatively? Or are you saying git aggressively tries to minimize a few bytes of network traffic and sends only parts of this list, building the intersection piecemeal?

(For example, if the goal was to very aggressively optimize for minimizing network traffic, we could read the entire repository history to figure out which refs were ancestors of other refs first? Then we could drop those and only ask for descendant refs. But this seems crazy, since it's saying that ~20 bytes of network traffic is more costly than like one hundred million disk I/O operations?)

Probably bump the version unconditionally now

If we do this, we get some churn when this happens:

  • A user pushes a commit (X) to the writable remote.
  • Device A checks the remote, sees new stuff (X), fetches the commit and bumps the version, to v94.
  • Device B checks the remote, sees new stuff (X), fetches the commit and bumps the version again, to v95. Both v94 and v95 refer to the same repository state.
  • Device A may now do a pointless no-op fetch if it hits a read synchronization to get up to Device B's version, even though Device A and Device B have the same repository state.

But I think this is fairly easy to fix by just hashing the repository state, and bumping the version if:

  • we fetched something; and
  • the leader's repository state hash differs from our repository state hash after the fetch.

I'm going to put D17497 + D17498 into this release even though they don't directly tackle the issue here and I think they're slightly risky changes (mostly because git ls-remote may have odd behaviors in some cases, and we don't currently use it in other workflows). But they may help with T12296 and general cluster load issues, and the followup changes will generally be more complicated to reason about (more locking/concurrency stuff), so I think getting these in earlier spreads risk out somewhat even though they're something to watch out for in this release.