Page MenuHomePhabricator

Provide a formal "destroyed" status for Phacility instances
Open, WishlistPublic

Description

Currently, Phacility instances enter the "suspended" state, but just stay there forever. Their data may be destroyed after 90 days, but this is a node-level operation and instances remain bound to the shards they used as services forever.

This is largely low-impact, but has some silly effects where, for example, deployment scripts load thousands of instances and then iterate through them skipping nearly all of them. It also makes cluster compaction/rebalancing decisions more complicated because the "Suspended for less than 90 days, might be trivially reactivated" and "Suspended for 90+ days, nuked, totally dead without heroic intervention" instances aren't separated in the staff console.

Instances should have a formal "destroyed" state. The process to enter this state should be:

  • confirm instance is eligible for destruction;
  • run node-level destruction on all service nodes;
  • de-link or disable all service nodes;
  • put instance in "destroyed" state.

(Maybe a "destroying" state is also useful so cases which begin but do not complete this process are obvious.)

Once this is available, all eligible instances (out of service for 90+ days) should be destroyed.

Event Timeline

epriestley triaged this task as Normal priority.Jun 1 2021, 3:37 PM
epriestley created this task.

A related issue is that I think nothing currently destroys S3 data. For most instances this isn't significant, but it isn't helping anything. This should likely be part of the database destruction step, although it can probably interact with the S3 bucket directly.

Instances technically have a formal "Deleted" status -- but it isn't really used by anything, nothing ever puts them into that status, and there are no instances in that status. For consistency with existing CLI workflows, I'm going to rename this to "Destroyed".

epriestley lowered the priority of this task from Normal to Wishlist.Dec 16 2021, 2:58 PM

It would still be nice to have this from a completeness/correctness perspective, but other changes have made it less valuable:

  • Migrating everything from m3 hosts to m4 hosts (see T13630) just left all the dead instances behind, effectively sweeping them under a rug.
  • Some service binding, deployment, and status tooling had to get smarter to do migrations, and this further reduced the costs of having dead-instance cruft.