@zeeg asked about this on IRC:
epriestley how are you planning to deal w/ mirroring commits
commit hooks and round robin? central master that pushes to all mirrors?
I don't have a highly specific concrete plan yet, but here are some general ideas.
- I want masters/slaves to be transparent to users. They should always push/pull from the same URL and get the same results.
- I want pulls to always reflect all changes pushed at the point when the pull started. That is, if you run git push and it exits, and then you run git pull, you should always get the changes you pushed. Similarly, git pull + git pull should never connect you to a repository which is behind the second time.
- In cases where installs can accept mirror latency, they should just use mirroring, which already exists. We could bring this more onboard if there's need for it (i.e., easy in-Phabricator mirrors) but I think the technical implementation is already complete and correct. Mirrors just get pushed to, don't support write operations from normal clients, and may be behind. We'll just ignore these for the purposes of replication.
- I want things to be self-healing without additional pushes, which generally means commit hooks can not be the only replication trigger, as we don't have a way to fire them again.
- Replication should have the smallest impact on the runtime of git push that it can. That is, we don't want git push to cost O(N) in the number of replicas.
Here are how other systems work:
Gitolite
Gitolite supports master/slave setups:
It offers some level of transparency by proxying SSH requests, although I'm not sure if it supports putting multiple hosts behind a loadbalanced domain name.
Gitolite does not offer a consistency guarantee:
From v3.5.3 on, gitolite uses an asynchronous push to the slaves, so that the main push returns immediately, without waiting for the slave pushes to complete. Keep this in mind if you're writing scripts that do a push, and then read one of the slaves immediately -- you will need to add a few seconds of sleep in your script.
From the documentation, I'm not sure if replication is self-healing, but it's fairly moot without a consistency guarantee.
It looks like it moved from O(N) to O(1) costs in v3.
Gerrit
Gerrit only appears to support what we call mirroring, not real replication. No consistency and replicas aren't and don't look writable.
WanDisco Git Multisite (??)
http://www.wandisco.com/git/multisite
I've never heard of this and have no clue how it works. It claims to offer all the properties one would expect, but is a super enterprisey mess and I don't know what it actually does under the hood.
GitHub Enterprise
No real support, I think?
Here are some components we can build to achieve replication. There are a few different approaches we can take, but they'd be based on these fundamentals:
- Logical clocks for repositories. Basically, every repository has a version which starts at 0 and increments when it gets pushed.
- Blocking pulls. When you pull from a host, it checks the master/largest logical clock for the repository and does a pull if it's behind. Then it processes your request.
- SSH forwarding. When you pull or push from a host, it checks the master/largest logical clock for the repository and forwards you to a host which is up to date.
- Global locks. When you try to push to a host, we acquire a global lock on the repository, do a blocking pull if necessary, and then process your request.
- Passive replication. Fully backgrounded replication which pulls copies with lagging logical clocks. This amortizes the replication cost toward 0 in most cases.
My initial thinking is to do this:
- We build logical clocks.
- For pulls, we do blocking pulls.
- For pushes, I'm not sure if locks or forwarding are better. They seem about equal, with a mixture of advantages and disadvantages. I'm leaning toward locks, since every node can be writable.
- We do passive replication.
This gives us all the desired properties, fairly easy administration, and no real bad thundering herd cases. We can also build almost all of this stuff very gradually, and run at least some of it meaningfully even on non-replicated repositories.
I think the worst case is that pushes may cost a pull plus a push if you beat passive replication and happen to hit a different master. This doesn't seem like a big deal. If we bump into issues, we can do SSH forwarding to masters instead. I think either approach could easily be faster on average, though, depending on where things are geographically and the size and frequency of pushes.
We could also look at SSH forwarding for pulls, but that can create a thundering herd immediately after a push of a large commit. If someone pushes 1GB of dumb changes, I'd much rather make everyone wait than kill the master (doubly so if we can print "waiting for alincoln's dumb huge change to replicate").
Generally, this is a much easier problem than, say, database replication, because it's completely fine to have average lock overhead of like 50ms and almost arbitrarily long worst cases (the worst case is where we wait for a huge push to replicate), and we have a very small number of mutable objects which we can think of as append-only, none of which would be OK with a database.