Page MenuHomePhabricator

Move toward multi-master replicated repositories
ClosedPublic

Authored by epriestley on Apr 12 2016, 12:22 PM.
Tags
None
Referenced Files
F18694515: D15688.id.diff
Sat, Sep 27, 3:39 AM
F18630416: D15688.diff
Sep 16 2025, 11:38 AM
F18626225: D15688.id37799.diff
Sep 16 2025, 12:49 AM
F18438254: D15688.id37799.diff
Aug 31 2025, 12:48 PM
F18436733: D15688.id37804.diff
Aug 31 2025, 11:23 AM
F18433146: D15688.diff
Aug 31 2025, 7:30 AM
F18176338: D15688.id.diff
Aug 15 2025, 11:30 PM
F18151361: D15688.diff
Aug 14 2025, 6:33 PM
Subscribers
None

Details

Summary

Ref T4292. This mostly implements the locking/versioning logic for multi-master repositories. It is only active on Git SSH pathways, and doesn't actually do anything useful yet: it just does bookkeeping so far.

When we read (e.g., git fetch) the logic goes like this:

  • Get the read lock (unique to device + repository).
    • Read all the versions of the repository on every other device.
    • If any node has a newer version:
      • Fetch the newer version.
      • Increment our version to be the same as the version we fetched.
  • Release the read lock.
  • Actually do the fetch.

This makes sure that any time you do a read, you always read the most recently acknowledged write. You may have to wait for an internal fetch to happen (this isn't actually implemented yet) but the operation will always work like you expect it to.

When we write (e.g., git push) the logic goes like this:

  • Get the write lock (unique to the repository).
    • Do all the read steps so we're up to date.
    • Mark a write pending.
      • Do the actual write.
    • Bump our version and mark our write finished.
  • Release the write lock.

This allows you to write to any replica. Again, you might have to wait for a fetch first, but everything will work like you expect.

There's one notable failure mode here: if the network connection between the repository node and the database fails during the write, the write lock might be released even though a write is ongoing.

The "isWriting" column protects against that, by staying locked if we lose our connection to the database. This will currently "freeze" the repository (prevent any new writes) until an administrator can sort things out, since it'd dangerous to continue doing writes (we may lose data).

(Since we won't actually acknowledge the write, I think, we could probably smooth this out a bit and make it self-healing most of the time: basically, have the broken node rewind itself by updating from another good node. But that's a little more complex.)

Test Plan
  • Pushed changes to a cluster-mode repository.
  • Viewed web interface, saw "writing" flag and version changes.
  • Pulled changes.
  • Faked various failures, got sensible states.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Move toward multi-master replicated repositories.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.

Here is a thrilling photo showing a number and also a little icon:

Screen Shot 2016-04-12 at 5.31.36 AM.png (667×1 px, 113 KB)

chad edited edge metadata.
This revision is now accepted and ready to land.Apr 12 2016, 3:07 PM
This revision was automatically updated to reflect the committed changes.