Ref T13216. See PHI943. If autoscale lightning strikes all your servers at once and destroys them, the path to recovery can be unclear. You're "supposed" to:
- demote all the devices;
- disable the bindings;
- bind the new servers;
- put whatever working copies you can scrape up back on disk;
- promote one of the new servers.
However, the documentation is a bit misleading (it was sort of written with "you lost one or two devices" in mind, not "you lost every device") and demote-before-disable is unnecessary and slightly risky if servers come back online. There's also a missing guardrail before the promote step which lets you accidentally skip the demotion step and end up in a confusing state. Instead:
- Add a guard rail: when you try to promote a new server, warn if inactive devices still have versions and tell the user to demote them.
- Allow demotion of inactive devices: the order "disable, demote" is safer and more intuitive than "demote, disable" and there's no reason to require the unintuitive order.
- Make the "cluster already has leaders" message more clear.
- Make the documentation more clear.