Page MenuHomePhabricator

Periodically run `git prune` on Git working copies
Open, NormalPublic

Description

See PHI497. An instance ran into an issue where the remote was complaining about too many unreachable objects:

remote: warning: There are too many unreachable loose objects; run 'git prune' to remove them.

We don't currently run git prune or git gc automatically since it has never caused issues before, but should probably start running git gc on working copies every so often.

One possible issue with this is that git gc can take a very long time to complete on large working copies. Previously, in PHI386, an unusually large repository for the same instance took about 5 hours to git gc. However, it took less than a minute to git prune, so maybe git prune is more safe to run regularly.

A possible workaround is to run these operations as part of bin/remote optimize, or some similar workflow, in the cluster, only.

Event Timeline

epriestley triaged this task as Normal priority.Mar 22 2018, 4:56 PM
epriestley created this task.

Previously, in PHI386, an unusually large repository for the same instance took about 5 hours to git gc.

Is that possibly just because git gc had not been run recently before, and regular executions wouldn't accumulate so much badness?
If git prune was run after I'd be wary of the prior git gc being the reason for the speediness.

That's a reasonable point. The details of PHI386 weren't very indicative one way or another since I didn't end up GC'ing the repository multiple times. I suppose I can go GC it again during the deployment window this week and see how long it takes.

alexmv added a subscriber: alexmv.Apr 4 2018, 6:17 PM
aubort added a subscriber: aubort.Aug 9 2018, 8:29 AM
epriestley moved this task from Backlog to Do Eventually on the Phacility board.Aug 10 2018, 6:06 PM

PHI860 is a close variant of this and discusses periodically running git repack. Most concerns around git repack are likely similar to concerns around gc and prune.

When a clustered repository node is repacking, collecting, or pruning, it may make sense for it to mark itself as "lower priority" for reads and writes. See also T10884. This would let it stay in the cluster, but shed most traffic until the operation completes.

See PHI1367, which featured a specific case where git repack unambiguously did something good instead of being magic fairy dust that we sprinkle around to ward off demons.

In this case, a repository with several 400MB objects required approximately 10 minutes to git clone, consuming large amounts of CPU and memory. In ps auxww the culprit process was git pack-objects (pretty sure -- it has passed beyond my scroll buffer by now). A cp -R of the same working copy with the same source and destination disks took ~10 seconds, implying that this was not I/O limited. Interactively, the client stalled in remote: Compressing objects:.

This leads to a whole tangled jungle of repack + gc + core.bigFileThreshold and a number of other tuning options, and the rough conclusion that Git was very inefficiently delta-compressing gigantic files for a long time for no real benefit.

Practically, I ran git repack -a -d and the clone cost dropped to ~20s.

Although I'd like to be on far firmer ground here before we aggressively start repacking everything, a more narrow intervention might be to select the repository backups which take the longest to run (or most time per byte, even), repack a few, and see what falls out.

See PHI1613, where an install hit this warning (and resolved it by running git prune):

remote: warning: The last gc run reported the following. Please correct the root cause
remote: and remove gc.log.
remote: Automatic cleanup will not be performed until the file is removed.
remote:
remote: warning: There are too many unreachable loose objects; run 'git prune' to remove them.
remote:

The logic here appears to be that gc.auto is set to some value (by default: 6,700). If the number of loose objects exceeds this threshold (technically, if the number of loose objects in objects/17/ is more than 1/256th of this value), it triggers a repack (in a comment, git repack -d -l).

I think that git repack will not repack unreachable objects. The manpage says:

This command is used to combine all objects that do not currently reside in a "pack", into a pack.

...which suggests loose objects ("all objects that do not currently reside in a pack") are packed, but the -A option clarifies:

Unreachable objects are never intentionally added to a pack, even when repacking.

So git does a repack, trying to put the 6,700 loose objects into a (new/existing?) pack.

After this repack, it tests if the number of objects still exceeds the gc.auto threshold. If it does, it raises this warning. The idea is that there are so many loose objects that the auto GC threshold is permanently exceeded, so the auto GC gets disabled.

This sort of makes sense, but isn't great. If my read is correct, it means that pushing enough objects to a repository between auto GC cycles (and then deleting the refs which point to them) can always permanently (?) wedge the auto GC. Naively, this is only 6,700 objects. If you're adversarial, you can push only objects with hashes beginning 17... and push just 27 objects!

It's not clear to me how prune interacts with the automatic GC, and it's possible that prune will unwedge the GC after two weeks (or whatever the prune threshold is). However, you could possibly push 27 objects post-dated into 2030 to wedge the GC for a decade -- I'm not sure exactly what date prune uses, but I assume it must be the commit date, since AFAIK objects have no "arrival in this repository" timestamp. I guess it could use actual filesystem timestamps?

It is filesystem timestamps:

reachable.c
	if (stat(path, &st) < 0) {

...

	add_recent_object(oid, st.st_mtime, data);

Okay, so maybe the GC unwedges after two weeks.

This possibly suggests these pruning steps:

  • Disable gc.auto so adversarial pushes can not wedge the GC.
  • Run git prune --expire <some explict expiry policy>.
  • Run git repack -a -d [-l] [--window=X] [--depth=Y], possibly using the secret --unpack-unreachable=<explicit expiration policy> flag.
  • Possibly adjust pack.packSizeLimit or pass --max-pack-size. It's not clear how these options interact with -a offhand. The default is unlimited, but 50GB packfiles might be a bad idea.
  • Possibly adjust core.bigFileThreshold to limit attempts to delta-compress large binary files. The default is fairly large (512MB).

The --window=X and --depth=Y flags are normally adjusted by git gc --aggressive, and look like they correspond to spending more or less time/CPU trying to pack better.

PHI1655 identifies a specific case where enormous packfiles may create problems:

  • The repository is unusually large (many GB) and contains large blobs (various application assets).
  • Downloading it over an imperfect network link through a VPN occasionally fails.
  • Git appears to delete the in-progress packfile if the link is disrupted.

Open questions:

  • Does using --max-pack-size to reduce the maximum packfile size really let Git "checkpoint" after each packfile, so the process is effectively resumable?
  • If you specify --max-pack-size 4MB and have a 5MB object in the repository (that is, an object exists which is too large to fit into a pack under the provided --max-pack-size rule), what happens? (If the answer is "you get one 5MB packfile", that's fine; if it's "the repack fails" that's less fine.)

Does using --max-pack-size to reduce the maximum packfile size really let Git "checkpoint" after each packfile, so the process is effectively resumable?

I suspect the answer to this is "no", and that network traffic in modern versions of the Git protocol is agnostic to packfile format on disk. Casual inspection of upload-pack.c provides some evidence that on-disk format and on-wire format are unrelated.

PHI1655 constructs a "resumable" clone out of bare commands (by running git fetch-pack in a loop) but I think it's implausible for non-experts to succeed with this workflow. A hypothetical arc clone could use this workflow with some server-side support, but this is a bit yikes.

Since many of these options probably don't have "right answers", I'm trying this reasonable-seeming variation on some repositories which seem like they'll benefit from a repack:

$ git -c core.bigFileThreshold=32M repack -a -d -l --max-pack-size 64M --window 128 --depth 32