Page MenuHomePhabricator

When a Phacility "rbak" device does not exist, backups can run twice and converge to a "successful" but inconsistent state
Closed, WontfixPublic

Description

See PHI2004. This is mostly a clerical issue:

  • If an rbak device does not exist, backups on the host will fail when they try to record that backups were created or pruned.
  • Backups will then retry and succeed, since they see the backup exists on disk (so they don't try to create it again) or no longer exists (so they don't try to prune it again). Since the disk state is already correct, they conclude no bookkeeping is required and exit successfully.

The net effect is that we get an out-of-date view in the web record of backups, but everything ultimately works fine.

This system could be made "more atomic", although the outcome here is approximately the least bad failure. It's good that Almanac failures don't prevent backups from being written, and don't send the system into a death spiral where it repeatedly writes more and more copies of the same backups forever, etc.

For now, I expect to simply fix the affected Almanac definitions.

Event Timeline

epriestley triaged this task as Wishlist priority.Feb 24 2021, 10:03 PM
epriestley created this task.
epriestley claimed this task.
  • Hosts in the repo class are now build by Piledriver (see T13630), which automatically creates the rbak device entries, so this error isn't likely to occur again.
  • I also don't expect to launch any more hosts.

So I don't expect to fix this.