Make "bin/repository thaw" workflow more clear when devices are disabled


Make "bin/repository thaw" workflow more clear when devices are disabled

Ref T13216. See PHI943. If autoscale lightning strikes all your servers at once and destroys them, the path to recovery can be unclear. You're "supposed" to:

  • demote all the devices;
  • disable the bindings;
  • bind the new servers;
  • put whatever working copies you can scrape up back on disk;
  • promote one of the new servers.

However, the documentation is a bit misleading (it was sort of written with "you lost one or two devices" in mind, not "you lost every device") and demote-before-disable is unnecessary and slightly risky if servers come back online. There's also a missing guardrail before the promote step which lets you accidentally skip the demotion step and end up in a confusing state. Instead:

  • Add a guard rail: when you try to promote a new server, warn if inactive devices still have versions and tell the user to demote them.
  • Allow demotion of inactive devices: the order "disable, demote" is safer and more intuitive than "demote, disable" and there's no reason to require the unintuitive order.
  • Make the "cluster already has leaders" message more clear.
  • Make the documentation more clear.

Test Plan:

  • Bound a repository to two devices.
  • Wrote to A to make it a leader, then disabled it (simulating a lightning strike).
  • Tried to promote B. Got a new, useful error ("demote A first").
  • Demoted A (before: error about demoting inactive devices; now: works fine).
  • Promoted B. This worked.

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T13216

Differential Revision: https://secure.phabricator.com/D19793