Page MenuHomePhabricator

Cycle More AWS Hosts (October 2018)
Closed, ResolvedPublic

Description

I'm going to spend like 10 minutes trying to call DescribeInstanceStatus instead of copy/pasting these since the web UI for getting the actual list of instances requires about 9 clicks per instance.

2018-10-09 11:00:00 AM i-a2ae04fd          db009.phacility.net      system-reboot scheduled reboot 
2018-10-09 1:00:00 PM  i-0ef9bb23c49a101ca db021.phacility.net      system-reboot scheduled reboot 
2018-10-09 3:00:00 PM  i-0bee1f2525d6c6f73 web006.phacility.net     system-reboot scheduled reboot 
2018-10-09 5:00:00 PM  i-a6ae04f9          db010.phacility.net      system-reboot scheduled reboot 
2018-10-09 7:00:00 PM  i-d83fe14f          repo006.phacility.net    system-reboot scheduled reboot 
2018-10-09 9:00:00 PM  i-087ba057          repo007.phacility.net    system-reboot scheduled reboot 
2018-10-09 11:00:00 PM i-0caf0553          repo011.phacility.net    system-reboot scheduled reboot 
2018-10-10 1:00:00 AM  i-2694b1e6          saux001.phacility.net    system-reboot scheduled reboot 
2018-10-10 3:00:00 AM  i-01583e28f8c2b6973 db023.phacility.net      system-reboot scheduled reboot 
2018-10-10 5:00:00 AM  i-f172d33a          bastion005.phacility.net system-reboot scheduled reboot 
2018-10-10 3:00:00 PM  i-62af053d          db012.phacility.net      system-reboot scheduled reboot

Revisions and Commits

Restricted Differential Revision

Event Timeline

epriestley triaged this task as Normal priority.Oct 1 2018, 3:59 PM
epriestley created this task.

"Use the API" seemed to work OK. Of those instances, only bastion005 is at all unusual.

epriestley added a revision: Restricted Differential Revision.Oct 1 2018, 4:37 PM
epriestley added a commit: Restricted Diffusion Commit.Oct 1 2018, 8:16 PM

I 'm going to get these underway once the deploy finishes.

only bastion005 is at all unusual.

For bastion, I'm going to put a bastion006 in service beside bastion005, swap to it, then tear down bastion005. The bastion hosts have no state, but if we enter a state where we have no bastions up we can't get to anything else in the cluster. Bastions can deploy without other bastions existing (and we've done so four times, RIP bastions 1-4) but it's been a while since they rotated.

epriestley added a commit: Restricted Diffusion Commit.Oct 6 2018, 3:36 PM

I cycled all the hosts except bastion. saux001 needs to be vetted a bit (it handles "Land Revision" from the web UI) but it isn't critical if it needs a bit more work.

There's a minor deadlock on bastion deployment with the current scripts: during deploy, we run deploy-key to copy the deploy key from the bastion to the target host during deployment, so that we don't need to put the entire keystore on normal cluster nodes, and so that we don't need to have the keystore on the control host (staff laptop) outside the cluster.

However, this operation doesn't make sense if we're actually deploying a bastion, since other bastions can't connect to it normally and there may be no other bastions (if we're deploying the first bastion after some kind of failure). For now, I'm just working around this by:

  • skipping deploy-key;
  • manually putting the deploy key in place in ~/.ssh/id_rsa.

This should be more clever but it's "good" that it can't easily be automated, in the sense that staff laptops should not have the deploy key present at deploy time in any other cases. This might make sense as a separate --deploy-key ... flag which is unique to bin/remote deploy bastion-class operations. I suspect T13076 will lead toward something more structured here.

I turned bastion.phacility.net and bastion-external.phacillity.net into CNAME records and pointed them at the new bastions.

(Technically, bastion007, since bastion006 was already deployed during cluster research.)

I also needed to copy the old master.key from bastion005 to bastion007 in /core/lib/keystore/.

I'm pretty sure we don't actually depend on this key after T12608 and removing it appears to work, but I'm hesitant to break too much stuff all at once.

epriestley added a revision: Restricted Differential Revision.Oct 8 2018, 3:41 PM

I think this is all done but want to let things run against bastion007 for a bit before I tear down bastion005.

One more of these just came in for repo003.

Plus: db018.phacility.net, repo001.phacility.net, db024.phacility.net.

Taking care of these now. I expect everything to be pretty routine.

We haven't seen any bastion issues so I stopped bastion005.

I think everything here is now fully cycled, synchronized, and cleaned up.

$ ./bin/provision events
Querying events...
Date                   ID                  Host                  Code          Description
2018-11-02 11:00:00 AM i-50ad3c90          repo003.phacility.net system-reboot [Completed] scheduled reboot 
2018-11-04 10:00:00 PM i-0c6de4f4702c443af db018.phacility.net   system-reboot [Completed] scheduled reboot 
2018-11-05 8:00:00 AM  i-a9161561          repo001.phacility.net system-reboot [Completed] scheduled reboot 
2018-11-05 10:00:00 AM i-0b43057505ff504f1 db024.phacility.net   system-reboot [Completed] scheduled reboot