Page MenuHomePhabricator

AWS instance termination may fail/hang indefinitely
Open, LowPublic

Description

See PHI1566. Just noting this for future tooling considerations: AWS may take several hours to terminate an instance.

In PHI1566, repo023 is inconsistent/degraded (AWS claims it's okay and it's doing some stuff, but not enough stuff to be a functional repository shard). I issued a "Reboot" which hung for several minutes (which I've seen before), then issued a "Terminate" which has hung for about 8 minutes so far (which I haven't seen before).

This documentation suggests that this is routine enough to document, at least:

If your instance remains in the shutting-down state for several hours, Amazon EC2 treats it as a stuck instance and forcibly terminates it.
If it appears that your instance is stuck terminating and it has been longer than several hours, post a request for help to the Amazon EC2 forum. To help expedite a resolution, include the instance ID and describe the steps that you've already taken.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesShuttingDown.html

Some notes:

  • The host uptime was 983 days. Possibly, we should just assume AWS hosts decay over time and need to be cycled every X days (maybe once a month). We've been seeing more issues in this general vein recently with high-uptime hosts; most of our hosts are high-uptime so this might just be AWS getting worse over time, but high-uptime hosts are likely unusual.
  • For automation, a strategy of "launch a new instance, swap over, terminate the old instance" is probably desirable anyway, but particularly given that termination is bounded above at multiple hours.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision

Event Timeline

epriestley created this task.

To deal with this narrowly, I'm going to:

  • launch a new replacement repo023 instance;
  • move the volumes over;
  • update Almanac and DNS; and
  • deploy it.

Hopefully volume detachment works even though instance termination does not. If volume detachment fails, I'll try to snapshot the volumes and replace them as well.

Normal volume detachment is just spinning, which isn't exactly surprising. I'm going to give it a few minutes and then force detachment.

Instance termination completed after about 20 minutes and all the volumes detached. Since the original instance can be recycled, I'm going to reattach and restart it, and throw away the replacement host.

I'm deploying the new host now. We just crossed a release so I'm going to manually restore it to 72f82abe07 once it comes up (see also T13359). Then, I'll resynchronize instance services for active instances.

Minor issue that should be looked at during service sync arising from improved validation elsewhere:

$ PHABRICATOR_INSTANCE=... /core/lib/services/bin/services sync
Synchronizing namespace "phacility.net".
Synchronizing namespace "phacility.com".
Synchronizing network "phacility.net".
Synchronizing device "daemon.phacility.net".
Synchronizing device "db016.phacility.net".
Synchronizing device "repo023.phacility.net".
[2019-11-25 16:05:21] EXCEPTION: (PhabricatorApplicationTransactionValidationException) Validation errors:
  - You can not delete this interface because it is currently in use. One or more services are bound to it. at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1066]
arcanist(head=stable, ref.master=cc850163f30c, ref.stable=bac2028421a4), libcore(), phabricator(head=stable, ref.master=eb6df7a2091a, ref.stable=72f82abe0723), phutil(head=stable, ref.master=39ed96cd818a, ref.stable=1750586fdc50), services(head=stable, ref.master=c4bd119b358e, ref.stable=2d7586076ae4)
  #0 PhabricatorApplicationTransactionEditor::applyTransactions(AlmanacInterface, array) called at [<services>/management/ServicesSyncWorkflow.php:483]
  #1 ServicesSyncWorkflow::syncDevice(array, AlmanacDevice, array, array) called at [<services>/management/ServicesSyncWorkflow.php:175]
  #2 ServicesSyncWorkflow::syncServices(array) called at [<services>/management/ServicesSyncWorkflow.php:42]
  #3 ServicesSyncWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:457]
  #4 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:349]
  #5 PhutilArgumentParser::parseWorkflows(array) called at [/core/lib/services/scripts/services.php:20]

The effective sync requirement is just an address update so I'll just do that surgically in cases where the sync flow hits this.

I'll flesh this out more later, but the move away from db123 = repo123 shard pairing, plus bin/host query using mysql makes it difficult to directly query instances using a particular repository service.

(Updating addresses with bin/host query leaves the service address cache dirty (the "mutable structure cache" via PhabricatorRepository->getAlmanacServiceRefs()) so it should be followed with bin/cache purge --caches general.)

PHI1566 is resolved narrowly. These cleanup steps still need to happen.

Immediately

  • Update Almanac definitions for all instances not on the paired db023 shard.
  • Purge the Almanac cache for all instances on the service.

Next

  • Because bin/host query uses mysql [-h localhost] ... and repo hosts do not install mysql-client and do not run mysql, there is no way to bin/host query against the set of instances using a particular repository shard service. Having a --raw mode to bin/host query is probably still valuable, but query-by-service would be materially useful in this case.
  • bin/services sync should be updated to handle this case correctly.
  • Ideally, a "sync all instances using service X" flow should exist as a staff/operations tool.

Update Almanac definitions for all instances not on the paired db023 shard.

The cleanest fix here is just to bulk-update everything since the tools can query every instance efficiently.

A sub-note here: bin/host query exits with an error if the last instance it queried failed (usually because the table does not exist). This isn't consistent with how other commands handle errors and propagates up to phage in a confusing way. The error is clearly communicated with output already, bin/host query should just exit 0 if it runs to completion.

I identified affect Almanac interfaces like this:

$ phage remote query --pools db -- --query 'SELECT * FROM <INSTANCE>_almanac.almanac_interface WHERE address = "172.30.0.89";' | tee interfaces.log

Hypothetically, instances could have their own Almanac devices with interfaces on this address which would be unsafe to update. In practice, the sync flow retains PHIDs and every interface is bound to the network PHID-ANET-22nzhtcmulibfkwmrsja, which is the central "phacility.net" network, so these interfaces are all safe to update.

I updated these interfaces with:

$ phage remote query --pools db -- --query 'UPDATE <INSTANCE>_almanac.almanac_interface SET address = "172.30.0.92" WHERE address = "172.30.0.89";'

Then I verified I'd hit everything by querying again. The new query output showed no interfaces bound to the old address remained on any instance.

Purge the Almanac cache for all instances on the service.

This is supported in a service-oriented way:

repo23 $ for i in `/core/bin/host instances --instance-statuses up --list`; do PHABRICATOR_INSTANCE=$i /core/lib/phabricator/bin/cache purge --all; done

That went through cleanly, so I believe the shard is now fully repaired.

epriestley added a revision: Restricted Differential Revision.Nov 25 2019, 10:46 PM
epriestley added a commit: Restricted Diffusion Commit.Nov 25 2019, 10:51 PM
epriestley added a revision: Restricted Differential Revision.Nov 25 2019, 11:57 PM
epriestley added a commit: Restricted Diffusion Commit.Nov 25 2019, 11:58 PM
epriestley added a revision: Restricted Differential Revision.Nov 26 2019, 12:09 AM
epriestley added a commit: Restricted Diffusion Commit.Nov 26 2019, 12:11 AM

there is no way to bin/host query against the set of instances using a particular repository shard service

D20926 + D20928 + D20929 fix this. By default, bin/host query now uses InstanceRef to find a "real" connection to instances. The old mode remains available with --raw-connection.

If something similar occurs again, bin/host query ... now operates in normal bulk mode from hosts with a repo role.

(This may have broken some implicit assumptions in bin/host release, but there's no pressure to fix those; I'll clean them up during the next release cut.)