Page MenuHomePhabricator

Phacility Cluster Maintenance: 2017 Week 37
Closed, ResolvedPublic

Description

Planned operations maintenance this week:

  1. Replace notify001, which is scheduled for AWS downtime on September 20th.
  2. T12932-related volume expansions:
    1. Upgrade dbak001 to 128GB (from 64GB).
    2. Upgrade dbak002 to 128GB (from 64GB).
    3. Upgrade ddata001 to 256GB (from 64GB). This will involve downtime for this shard.
    4. Upgrade ddata002 to 256GB (from 64GB). This will involve downtime for this shard.
  3. (T12819) Rebuild all indexes on all active instances to populate Ferret engine data.

Steps (1), (2A), and (2B) are not especially disruptive and can happen any time.

Steps (2C) and (2D) are disruptive and I expect to complete them off-peak alongside deployment.

Step (3) must happen after deployment.

Also desirable is improving maintenance notifications so shards have a less disruptive experience ("Scheduled maintenance" instead of "oops, it's totally broken"). This isn't strictly necessary but will also make later work on shard compaction smoother.


Also:

  • T12983, decommission vault001.
  • T12608, cycle the master.key.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision

Event Timeline

Replace notify001, which is scheduled for AWS downtime on September 20th.

This is probably a good opportunity to terminate SSL at an ALB or ELB in front of the host, see also T12917.

In T12847#229548 @joshuaspence suggested that this works with an ELB. I actually think I got things wrong in the thread there, and the actual capability matrix is:

Listen on 22?Terminate SSL?Websockets?TCPPrice
ELB-ClassicNoYesYesYesMedium
ELB-VPCYesYesYesYesMedium
ALBNoYesYesNoLower
HaproxyYesYesYesYesHigher

I'm inclined to try to use an ELB here since ALBs are more limited -- I'd prefer to have one type of LB in the absence of a good reason not to, and we know ALBs can't do everything we need LBs to do (raw TCP for SSH).

However, I'm not totally sure what an "ELB-VCP" LB is. The documentation here says they can do 22:

http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-listener-config.html

But the UI has no option to launch them? Maybe this is automatic magic?

Screen Shot 2017-09-12 at 10.05.25 AM.png (1×1 px, 208 KB)

Actually it looks like the next screen lets you pick a VPC to launch inside.

But nlb001 is actually already an ELB-VPC and we don't need port 22 on these hosts anyway.

So I'm going to do this:

  • Locally reconfigure notify001 to run without SSL, but don't restart it yet.
  • Reconfigure nlb001 to terminate SSL, then publish that change.
  • Restart notify001.
  • Test that moving SSL termination to nlb001 works.

If we're good:

  • Commit the notify001 config changes (no SSL config, pretty much) to the upstream.
  • Bring up notify002, start services.
  • Swap notify001 to notify002 in the LB.
  • Decommission notify001.

These hosts are stateless so there's no real risk here except downtime, which will only impact real-time notifications.

I'm a little confused about why I wasn't able to do SSH traffic over an ELB in the past, since all our ELBs are VPC ELBs which should support 22 -- and I was able to add a bogus 22 listener on nlb001 without any issues. I distinctly remember being unable to do this when I brought the cluster up initially, but maybe I was testing in the old (non-VCP) secure cluster or this capability became available after I tested. But we can probably replace vault001 with a vlb ELB now that this works, since that host is just haproxy serving as a TCP load balancer.

epriestley added a revision: Restricted Differential Revision.Sep 12 2017, 5:17 PM

Test that moving SSL termination to nlb001 works.

This works properly, so I'm going to continue and swap the host.

epriestley added a commit: Restricted Diffusion Commit.Sep 12 2017, 5:32 PM

I'm bumping into T12171 when bringing up the new host. I'm going to take another stab at figuring out what's going on there because the workaround I found in that task is ridiculous.

I think this is basically "node is bananas" and our AMI is Ubuntu 14 which ships with "Node for DOS".

The recommended PPA setup is "curl a script and pipe it to sudo", which I don't really want to put into anything we run automatically because we're Serious Software Developers here.

I'm just going to continue using the n garbage for now, which I'm sure is also a wrapper which just curls things and pipes them to sudo sh under the hood, and hope this resolves itself as we move to Ubuntu 16. If it does, we can go back and bring this host up to Ubuntu 16.

Swap notify001 to notify002 in the LB.

This is ready to go, but (ha ha ha) we hard-code the internal endpoint:

https://secure.phabricator.com/source/services/browse/master/src/config/PhacilitySiteSource.php;08219d678cee08bc3d7899b6df753ade177ef7a0$421

I'm just going to wait for the deploy to swap the hardware. Doing this "right" would involve setting up an internal LB, but since it's one host running nonessential services this doesn't seem like the highest priority operational issue we have. I'll stage the swap and then move on for now.

epriestley added a revision: Restricted Differential Revision.Sep 12 2017, 6:03 PM

Upgrade dbak001 to 128GB (from 64GB).
Upgrade dbak002 to 128GB (from 64GB).

I've created and mounted new bak volumes, and am now copying and swapping them.

Both bak volumes are now swapped. The old volumes are detached as dbak001.phacility.net-old and dbak002.phacility.net-old. I'll delete them after the deployment on Saturday if no issues arise before then.

I'll continue the other steps alongside the deployment.

epriestley added a commit: Restricted Diffusion Commit.Sep 21 2017, 11:02 AM
epriestley added a commit: Restricted Diffusion Commit.Sep 21 2017, 11:12 AM