Page MenuHomePhabricator

Periodically run `git prune` on Git working copies
Open, NormalPublic

Description

See PHI497. An instance ran into an issue where the remote was complaining about too many unreachable objects:

remote: warning: There are too many unreachable loose objects; run 'git prune' to remove them.

We don't currently run git prune or git gc automatically since it has never caused issues before, but should probably start running git gc on working copies every so often.

One possible issue with this is that git gc can take a very long time to complete on large working copies. Previously, in PHI386, an unusually large repository for the same instance took about 5 hours to git gc. However, it took less than a minute to git prune, so maybe git prune is more safe to run regularly.

A possible workaround is to run these operations as part of bin/remote optimize, or some similar workflow, in the cluster, only.

Event Timeline

epriestley triaged this task as Normal priority.Mar 22 2018, 4:56 PM
epriestley created this task.

Previously, in PHI386, an unusually large repository for the same instance took about 5 hours to git gc.

Is that possibly just because git gc had not been run recently before, and regular executions wouldn't accumulate so much badness?
If git prune was run after I'd be wary of the prior git gc being the reason for the speediness.

That's a reasonable point. The details of PHI386 weren't very indicative one way or another since I didn't end up GC'ing the repository multiple times. I suppose I can go GC it again during the deployment window this week and see how long it takes.

alexmv added a subscriber: alexmv.Apr 4 2018, 6:17 PM
aubort added a subscriber: aubort.Aug 9 2018, 8:29 AM
epriestley moved this task from Backlog to Do Eventually on the Phacility board.Aug 10 2018, 6:06 PM

PHI860 is a close variant of this and discusses periodically running git repack. Most concerns around git repack are likely similar to concerns around gc and prune.

When a clustered repository node is repacking, collecting, or pruning, it may make sense for it to mark itself as "lower priority" for reads and writes. See also T10884. This would let it stay in the cluster, but shed most traffic until the operation completes.

See PHI1367, which featured a specific case where git repack unambiguously did something good instead of being magic fairy dust that we sprinkle around to ward off demons.

In this case, a repository with several 400MB objects required approximately 10 minutes to git clone, consuming large amounts of CPU and memory. In ps auxww the culprit process was git pack-objects (pretty sure -- it has passed beyond my scroll buffer by now). A cp -R of the same working copy with the same source and destination disks took ~10 seconds, implying that this was not I/O limited. Interactively, the client stalled in remote: Compressing objects:.

This leads to a whole tangled jungle of repack + gc + core.bigFileThreshold and a number of other tuning options, and the rough conclusion that Git was very inefficiently delta-compressing gigantic files for a long time for no real benefit.

Practically, I ran git repack -a -d and the clone cost dropped to ~20s.

Although I'd like to be on far firmer ground here before we aggressively start repacking everything, a more narrow intervention might be to select the repository backups which take the longest to run (or most time per byte, even), repack a few, and see what falls out.