Page MenuHomePhabricator

Make "bin/repository thaw" workflow more clear when devices are disabled

Authored by epriestley on Nov 7 2018, 6:26 PM.
Referenced Files
F13242409: D19793.diff
Thu, May 23, 2:44 AM
Tue, May 21, 8:15 PM
F13238581: D19793.id47261.diff
Tue, May 21, 8:09 PM
F13238405: D19793.id47278.diff
Tue, May 21, 7:12 PM
F13222672: D19793.diff
Sun, May 19, 3:51 AM
F13178997: D19793.diff
Wed, May 8, 8:51 PM
Unknown Object (File)
Sun, May 5, 9:05 PM
Unknown Object (File)
Sat, May 4, 6:16 PM



Ref T13216. See PHI943. If autoscale lightning strikes all your servers at once and destroys them, the path to recovery can be unclear. You're "supposed" to:

  • demote all the devices;
  • disable the bindings;
  • bind the new servers;
  • put whatever working copies you can scrape up back on disk;
  • promote one of the new servers.

However, the documentation is a bit misleading (it was sort of written with "you lost one or two devices" in mind, not "you lost every device") and demote-before-disable is unnecessary and slightly risky if servers come back online. There's also a missing guardrail before the promote step which lets you accidentally skip the demotion step and end up in a confusing state. Instead:

  • Add a guard rail: when you try to promote a new server, warn if inactive devices still have versions and tell the user to demote them.
  • Allow demotion of inactive devices: the order "disable, demote" is safer and more intuitive than "demote, disable" and there's no reason to require the unintuitive order.
  • Make the "cluster already has leaders" message more clear.
  • Make the documentation more clear.
Test Plan
  • Bound a repository to two devices.
  • Wrote to A to make it a leader, then disabled it (simulating a lightning strike).
  • Tried to promote B. Got a new, useful error ("demote A first").
  • Demoted A (before: error about demoting inactive devices; now: works fine).
  • Promoted B. This worked.

Diff Detail

rP Phabricator
Lint Passed
Tests Passed
Build Status
Buildable 21109
Build 28696: Run Core Tests
Build 28695: arc lint + arc unit

Event Timeline

amckinley added inline comments.

Maybe "you will lose unreplicated data"? Only a Sith deals in absolutes.


I'm not sure I understand the use case for this. Isn't the sequence ("demote", "disable") strictly worse than ("disable", "demote")?

This revision is now accepted and ready to land.Nov 9 2018, 6:36 PM
epriestley added inline comments.

In this case, you don't disable at all: "we know what's wrong and the host is going to be back online in a few minutes, so I don't want to bother disabling and re-enabling later". This is pretty marginal and I tried to hedge it with the line below ("...safer to disable the device..."). I should perhaps remove this block completely (and partly kept it just because it was sort of already there), but there's also maybe some value if you accidentally demoted without disabling and are freaking out now? I'll give it a little more thought.

Maybe also like "someone pushed a 200GB commit and it's having trouble replicating". Also pretty marginal.


Yeah I'd probably just remove this block entirely, but I also don't think it's really hurting anyone.

  • Just remove the marginal "demote without disable" workflow from this case.
This revision was automatically updated to reflect the committed changes.