Provide a write-free, non-locking maintenance window for repositories
Open, NormalPublic
Actions

Assigned To

Authored By

	epriestley
	Feb 19 2021, 4:27 PM

Description

See PHI1996. See T13111.

Large Git repositories often benefit from regularly (say, daily or weekly) running some maintenance commands, from the general family of git prune, git gc, git repack, git reflog expire, or similar. The particular problems which occur (and the best commands to run to remedy them) can vary from repository to repository.

PHI1996 reports an issue where writing to a repository while it is running one of these three commands:

$ git reflog expire --expire-unreachable=now --all
$ git gc --prune=now
$ git prune

...caused a ref to go missing. I'm currently unsure about the exact mechanism here, but Phabricator should support maintenance windows which guarantee:

the node will process no writes during the maintenance window; and
the node is not the only cluster leader, unless it is also the only cluster node; and
ideally, reads are routed to nodes under maintenance at reduced precedence. It's still better to serve a read from a node under maintenance than to fail to serve it. (If problems arise with reads during maintenance commands, these reads could block once read routing is precedence-aware.)

Note that repositories already have a bin/repository maintenance mode, but this is aimed at Phacility SAAS migrations, is repository-level rather than node-level, and just stops new writes without guaranteeing writes have aborted. So this mechanism isn't really appropriate here, and probably primarily motivates calling this something other than "maintenance" mode to limit overloading.

See PHI2004. When a repository node is writing backups, we don't need a lock, but it would be nice to be able to provide a hint to Phabricator that the node is temporarily less-desirable for routing purposes.

Revisions and Commits

rP Phabricator
	D21671	rP51cb7a3db9e8 Provide an ad-hoc maintenance lock for clustered repositories
	D21669	rPbdda7eed0734 Improve display behavior for write locks held by omnipotent users
	D21670	rP12a5eb406233 Allow maintenance scripts to write synthetic events to the push log that act as…

Related Objects

Mentioned In: 2021 Week 23 (Early June)
Mentioned Here: T13111: Periodically run `git prune` on Git working copies

Event Timeline

epriestley triaged this task as Normal priority.Feb 19 2021, 4:27 PM

epriestley created this task.

epriestley updated the task description. (Show Details)Feb 24 2021, 10:10 PM

A useful maintenance operation for staging area repositories is to remove out-of-date staging refs: old diffs which have already landed. This is of some particular importance for large installs, since Git has a significant per-ref overhead for many operations until protocol v2: by the time a repository has ~50K refs, interacting with it in basically any way has become slow and cumbersome.

This class of operation might be a useful maintenance operation in general: to prune old release branches, temporary branches, etc.

The problem with this operation is that: it's a "real" write and it needs to acquire and hold the write lock, but it isn't a push. The desired underlying mechanism is git update-ref:

$schedule = array();
foreach ($refs as $ref) {
  $schedule[] = sprintf("delete %s\0\0", $ref);
}
$schedule = implode('', $schedule);

...

  $repository->getLocalCommandFuture('update-ref --stdin -z')
    ->write($schedule)
    ->resolvex();

This deletes refs very quickly (a few seconds even for tens of thousands of refs) and doesn't require a second working copy.

It's possible to git push file:///path/to/current/repository :ref/to/delete/1 ... (that is, delete refs by pushing from a working copy to the same working copy), but Phabricator won't accept non-SSH pushes and this can only delete as many refs at once as can be fit on the CLI. This is also fairly unintuitive, and I believe it is dramatically slower than update-ref (although this is from memory and I didn't re-measure it just now).

I think pushing from a repository to itself is also unintuitive, and it would be nice to provide a way to do an ad-hoc maintenance write: for example, to fetch changes from a remote rather than push them. There are a handful of use cases where a repository is merged into another repository or synchronized from a large upstream where ad-hoc server-side writes are desirable.

In any case, any server-side write that isn't a push doesn't write to the push log, so it doesn't bump the repository version, which is always just the MAX(id) in the push log table. Even if we're careful to acquire and hold the write lock, the repository version won't bump when we release the lock, and other nodes may accept a write and become leaders before whatever changes we made propagate. This will overwrite any maintenance writes we perform.

To fix this, we can insert a synthetic "push" into the push log, reflecting that maintenance occurred. This will cause a version bump so writes will propagate, so the lock can look like:

$cluster_engine->synchronizeWorkingCopyBeforeWrite();

do_special_writes();

artificially_bump_repository_version(...);

$cluster_engine->synchronizeWorkingCopyAfterWrite();

A minor issue on the way to this is that calling synchronizeWorkingCopyBeforeWrite() with an omnipotent viewer will write to the WorkingCopyVersion table with a null userPHID, which shows as "Unknown Object" in the UI.

This operation should likely attribute itself to the "Diffusion" application, but the UI should probably also be less clumsy about missing user details.

epriestley added a revision: D21669: Improve display behavior for write locks held by omnipotent users.Jun 1 2021, 1:13 PM

epriestley added a revision: D21670: Allow maintenance scripts to write synthetic events to the push log that act as repository updates.Jun 1 2021, 1:50 PM

Since observed repositories version differently today, this strategy won't work -- but I can't come up with any valid reason to ever put a repository into a "write maintenance" mode anyway. I do imagine making observed repositories "replay" fetches into the push log (as though they were pushes) in the future, but that still won't make "write maintenance" on an observed repository meaningful, so it seems fine to just prevent putting non-hosted repositories into this mode.

In an extreme case, like "the observed remote is down and you want to push updates", you could just turn the repository into a hosted repository until the remote comes back up.

epriestley added a revision: D21671: Provide an ad-hoc maintenance lock for clustered repositories.Jun 1 2021, 3:17 PM

epriestley added a commit: rP12a5eb406233: Allow maintenance scripts to write synthetic events to the push log that act as….Jun 1 2021, 3:29 PM

epriestley added a commit: rPbdda7eed0734: Improve display behavior for write locks held by omnipotent users.

epriestley added a commit: rP51cb7a3db9e8: Provide an ad-hoc maintenance lock for clustered repositories.

epriestley mentioned this in 2021 Week 23 (Early June).Jun 5 2021, 6:14 PM

Provide a write-free, non-locking maintenance window for repositoriesOpen, NormalPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Provide a write-free, non-locking maintenance window for repositories
Open, NormalPublic
Actions