Page MenuHomePhabricator

Implement repository replication
Closed, ResolvedPublic

Description

@zeeg asked about this on IRC:

epriestley how are you planning to deal w/ mirroring commits
commit hooks and round robin? central master that pushes to all mirrors?

I don't have a highly specific concrete plan yet, but here are some general ideas.

  • I want masters/slaves to be transparent to users. They should always push/pull from the same URL and get the same results.
  • I want pulls to always reflect all changes pushed at the point when the pull started. That is, if you run git push and it exits, and then you run git pull, you should always get the changes you pushed. Similarly, git pull + git pull should never connect you to a repository which is behind the second time.
    • In cases where installs can accept mirror latency, they should just use mirroring, which already exists. We could bring this more onboard if there's need for it (i.e., easy in-Phabricator mirrors) but I think the technical implementation is already complete and correct. Mirrors just get pushed to, don't support write operations from normal clients, and may be behind. We'll just ignore these for the purposes of replication.
  • I want things to be self-healing without additional pushes, which generally means commit hooks can not be the only replication trigger, as we don't have a way to fire them again.
  • Replication should have the smallest impact on the runtime of git push that it can. That is, we don't want git push to cost O(N) in the number of replicas.

Here are how other systems work:

Gitolite

Gitolite supports master/slave setups:

It offers some level of transparency by proxying SSH requests, although I'm not sure if it supports putting multiple hosts behind a loadbalanced domain name.

Gitolite does not offer a consistency guarantee:

From v3.5.3 on, gitolite uses an asynchronous push to the slaves, so that the main push returns immediately, without waiting for the slave pushes to complete. Keep this in mind if you're writing scripts that do a push, and then read one of the slaves immediately -- you will need to add a few seconds of sleep in your script.

From the documentation, I'm not sure if replication is self-healing, but it's fairly moot without a consistency guarantee.

It looks like it moved from O(N) to O(1) costs in v3.

Gerrit

Gerrit only appears to support what we call mirroring, not real replication. No consistency and replicas aren't and don't look writable.

WanDisco Git Multisite (??)

http://www.wandisco.com/git/multisite

I've never heard of this and have no clue how it works. It claims to offer all the properties one would expect, but is a super enterprisey mess and I don't know what it actually does under the hood.

GitHub Enterprise

No real support, I think?


Here are some components we can build to achieve replication. There are a few different approaches we can take, but they'd be based on these fundamentals:

  1. Logical clocks for repositories. Basically, every repository has a version which starts at 0 and increments when it gets pushed.
  2. Blocking pulls. When you pull from a host, it checks the master/largest logical clock for the repository and does a pull if it's behind. Then it processes your request.
  3. SSH forwarding. When you pull or push from a host, it checks the master/largest logical clock for the repository and forwards you to a host which is up to date.
  4. Global locks. When you try to push to a host, we acquire a global lock on the repository, do a blocking pull if necessary, and then process your request.
  5. Passive replication. Fully backgrounded replication which pulls copies with lagging logical clocks. This amortizes the replication cost toward 0 in most cases.

My initial thinking is to do this:

  • We build logical clocks.
  • For pulls, we do blocking pulls.
  • For pushes, I'm not sure if locks or forwarding are better. They seem about equal, with a mixture of advantages and disadvantages. I'm leaning toward locks, since every node can be writable.
  • We do passive replication.

This gives us all the desired properties, fairly easy administration, and no real bad thundering herd cases. We can also build almost all of this stuff very gradually, and run at least some of it meaningfully even on non-replicated repositories.

I think the worst case is that pushes may cost a pull plus a push if you beat passive replication and happen to hit a different master. This doesn't seem like a big deal. If we bump into issues, we can do SSH forwarding to masters instead. I think either approach could easily be faster on average, though, depending on where things are geographically and the size and frequency of pushes.

We could also look at SSH forwarding for pulls, but that can create a thundering herd immediately after a push of a large commit. If someone pushes 1GB of dumb changes, I'd much rather make everyone wait than kill the master (doubly so if we can print "waiting for alincoln's dumb huge change to replicate").

Generally, this is a much easier problem than, say, database replication, because it's completely fine to have average lock overhead of like 50ms and almost arbitrarily long worst cases (the worst case is where we wait for a huge push to replicate), and we have a very small number of mutable objects which we can think of as append-only, none of which would be OK with a database.

Revisions and Commits

rP Phabricator
D15986
D15903
D15798
D15795
D15786
D15783
D15772
D15761
D15759
D15758
D15757
D15755
D15754
D15752
D15748
D15747
D15688
D15685
D15683

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is there any way on replicating from gerrit to phabricator like refs/changes/

Is there any way on replicating from gerrit to phabricator like refs/changes/

That's outside the scope of this task and an implementation detail for us at WMF, to be honest.

Per IRC, for posterity: it's a combination of implementation detail for our fetches at WMF, as well as seeing this done: T6878: Tagged commits which are not ancestors of any branch head don't get imported

eadler added a project: Restricted Project.Jan 8 2016, 11:09 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

I'm guessing this is still years (or decades) away on the road map, but we want to be able to run build agents in the us-west-2 region and our Phabricator cluster currently resides in ap-southeast-2. Having a way of replication repositories across AWS regions (by setting up Phabricator cluster instances in us-west-2 and having the repositories replicate over to them) would be very useful in terms of reducing our git clone times.

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 7 2016, 6:35 PM

Some additional cases with this:

  • When choosing a device to proxy to while serving Diffusion HTTP requests, we should try to proxy to (or even require?) an up-to-date device (some thundering herd risk? But these requests are usually small/infrequent/easy to serve).
  • The PullLocal daemon needs to start treating version clocks as being similar to the NEEDS_UPDATE flag.

The diffusion.querycommits method needs to sync-before-read (at least, if bypassCache is provided?) but currently does not. This can lead to tasks failing on the daemon on an un-synchronized node. Things self-heal, but it would be nice to prevent this.

T10748 is moving into production, which is the last major new piece here. Remaining cleanup work I plan to do in this phase:

T10751 has additional discussion about followups, and will eventually spawn tasks covering future work.

T10940 should be resolved now, D15903 should resolve lock granularity.

I'm going to chew on observed repository versioning, I don't currently have a simple, elegant plan for it but imagine one may come to me in a dream. If nothing does I have some reasonable but inelegant approaches we can pursue.

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.May 13 2016, 9:38 PM

Strategy I'm pursuing (for Git) is:

  • The "version" of an observed repository is the largest internal commit ID of any of the active refs (branch heads and tags) in the repository.
  • Commit discovery is topological so in normal cases this is always a reasonable logical clock.
  • This clock may regress if you publish a branch, then delete it.
    • We won't actually wind the clock backward, just keep it at the high water mark.
    • This probably doesn't cause any real problems.

The deletion case means that branch deletion will not actively propagate in the cluster until the next push. It can still propagate passively. It's generally fine for a commit we don't expect to exist to actually exist: this is normal because git doesn't GC commits for a while anyway.

There may still be some potential situations where branches appear and disappear in the UI if you load Diffusion multiple times. I expect these will be so rare and unconcerning that no one will ever notice.

We could eventually move to putting a logical clock on ref changes, which is more like pretending each fetch from the remote is a push to us (we could even write synthetic push logs). This would allow us to increment the version on branch deletion, but is a larger and more complicated change which is more difficult to implement, understand, and administrate, and currently crosses process and lock boundaries.

That last part which I just landed hasn't been vetted in production for very long yet, but I think this all works now.

I think the only major known limitation is that there's no Mercurial support. This is likely easy to provide later, but we don't have any installs that are interested yet.

From here, there are many improvements we could make (like T10883), and I'm sure some bugs and such will turn up. See T10751 and followups for discussion.