Page MenuHomePhabricator

Make "bin/repository thaw" workflow more clear when devices are disabled
ClosedPublic

Authored by epriestley on Nov 7 2018, 6:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Dec 20, 10:33 PM
Unknown Object (File)
Mon, Dec 16, 4:24 PM
Unknown Object (File)
Oct 22 2024, 5:50 AM
Unknown Object (File)
Oct 18 2024, 6:51 AM
Unknown Object (File)
Oct 9 2024, 10:33 AM
Unknown Object (File)
Sep 15 2024, 3:05 AM
Unknown Object (File)
Sep 3 2024, 12:03 AM
Unknown Object (File)
Sep 3 2024, 12:03 AM
Subscribers
None

Details

Summary

Ref T13216. See PHI943. If autoscale lightning strikes all your servers at once and destroys them, the path to recovery can be unclear. You're "supposed" to:

  • demote all the devices;
  • disable the bindings;
  • bind the new servers;
  • put whatever working copies you can scrape up back on disk;
  • promote one of the new servers.

However, the documentation is a bit misleading (it was sort of written with "you lost one or two devices" in mind, not "you lost every device") and demote-before-disable is unnecessary and slightly risky if servers come back online. There's also a missing guardrail before the promote step which lets you accidentally skip the demotion step and end up in a confusing state. Instead:

  • Add a guard rail: when you try to promote a new server, warn if inactive devices still have versions and tell the user to demote them.
  • Allow demotion of inactive devices: the order "disable, demote" is safer and more intuitive than "demote, disable" and there's no reason to require the unintuitive order.
  • Make the "cluster already has leaders" message more clear.
  • Make the documentation more clear.
Test Plan
  • Bound a repository to two devices.
  • Wrote to A to make it a leader, then disabled it (simulating a lightning strike).
  • Tried to promote B. Got a new, useful error ("demote A first").
  • Demoted A (before: error about demoting inactive devices; now: works fine).
  • Promoted B. This worked.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

amckinley added inline comments.
src/docs/user/cluster/cluster_repositories.diviner
432

Maybe "you will lose unreplicated data"? Only a Sith deals in absolutes.

435–438

I'm not sure I understand the use case for this. Isn't the sequence ("demote", "disable") strictly worse than ("disable", "demote")?

This revision is now accepted and ready to land.Nov 9 2018, 6:36 PM
epriestley added inline comments.
src/docs/user/cluster/cluster_repositories.diviner
435–438

In this case, you don't disable at all: "we know what's wrong and the host is going to be back online in a few minutes, so I don't want to bother disabling and re-enabling later". This is pretty marginal and I tried to hedge it with the line below ("...safer to disable the device..."). I should perhaps remove this block completely (and partly kept it just because it was sort of already there), but there's also maybe some value if you accidentally demoted without disabling and are freaking out now? I'll give it a little more thought.

Maybe also like "someone pushed a 200GB commit and it's having trouble replicating". Also pretty marginal.

src/docs/user/cluster/cluster_repositories.diviner
435–438

Yeah I'd probably just remove this block entirely, but I also don't think it's really hurting anyone.

  • Just remove the marginal "demote without disable" workflow from this case.
This revision was automatically updated to reflect the committed changes.