Rebalance Phacility instances into a private subnet
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	May 26 2020, 8:02 PM

Description

Previously, see T13076. Several Phacility issues are ripe for resolution via mass-migration:

Balancing and compaction issues.
- Some shards are heavily loaded.
- Free instances can be put in shard jail.
Instance types: everything is "m3.large" and a bunch of hosts have 1K+ days of uptime. The newer "m4.large" chassis is more or less a strict upgrade.
System version: current systems are on Ubuntu 14 + PHP 5.x. We can upgrade to Ubuntu 20 + PHP 7.4.
Subnet/NAT issues in T12816.
Old volume types (T12999) and sizes.

Related Objects

Mentioned In: T13630: Move Phacility provisioning to Piledriver
T13467: Expand the "ddata003" volume
T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction
T12801: Simplify Almanac services in the Phacility production cluster
T12999: Replace cluster magnetic volumes with SSD volumes
T12856: Evaluate various "infrastructure-as-code" products
T12816: Setup NAT for the primary Phacility cluster
Mentioned Here: T13630: Move Phacility provisioning to Piledriver
T12799: Consider an "API Utilities" application
T12999: Replace cluster magnetic volumes with SSD volumes
T12816: Setup NAT for the primary Phacility cluster
T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction

Event Timeline

epriestley triaged this task as Low priority.May 26 2020, 8:02 PM

epriestley created this task.

Herald added a subscriber: amckinley. · View Herald TranscriptMay 26 2020, 8:02 PM

epriestley updated the task description. (Show Details)May 26 2020, 8:03 PM

epriestley mentioned this in T12816: Setup NAT for the primary Phacility cluster.

epriestley mentioned this in T12856: Evaluate various "infrastructure-as-code" products.May 26 2020, 8:07 PM

epriestley mentioned this in T12999: Replace cluster magnetic volumes with SSD volumes.May 26 2020, 8:10 PM

epriestley updated the task description. (Show Details)

epriestley mentioned this in T12801: Simplify Almanac services in the Phacility production cluster.

epriestley mentioned this in T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.May 26 2020, 8:13 PM

Subnet/NAT issues in T12816.

I've launched some test hardware into a fully-private subnet with a NAT gateway following the pattern in T12816, and it appears to be working well with no public IP address. I haven't put any instances on it yet and I'm not entirely sure that all the sub-services are in good shape, but most of the issues were previously cleared (notably vault / ALB issues).

System version: current systems are on Ubuntu 14 + PHP 5.x. We can upgrade to Ubuntu 20 + PHP 7.4.

I have deployment working for Ubuntu 20. The official image I'm using comes up with a lot of weird AWS/EC2 magic services which I'm a little skeptical about. The only significant issue I ran into was an upstart -> systemd switch.

Instance types: everything is "m3.large" and a bunch of hosts have 1K+ days of uptime. The newer "m4.large" chassis is more or less a strict upgrade.

As launched with the stock AMI, the m4.large instances have less space on their EBS root volume which leads to less space available to give to swap. This otherwise seems straightforward. Some pathway exists to create a similar AMI with more space, but it wasn't obvious to me how to do this.

Old volume types (T12999) and sizes.

Will be mooted by reallocating everything.

I've written some Terraform-class tooling which can likely automate all the actual hardware allocations. This needs some more work, but I believe the tricky stuff (mostly: representing resources and allowing templating to reference resources which haven't been built yet) is at least working.

Rebalancing repository shards requires automating the process in Migrating Repository Shards, but the hard stuff already works and it just needs orchestration glue.

There's less existing support for moving databases, but the process is also simpler if it's done as dump + load. If it's done as dump + load + replicate + wait + swap it's not so simple, but maybe not that bad. db shards are generally less of a problem than repo shards.

epriestley mentioned this in T13467: Expand the "ddata003" volume.May 26 2020, 8:32 PM

When Piledriver destroys a resource pile, it's helpful if it can read the entire authoritative state from sources by using only a pile ID.
- EC2 can do this with "DescribeTags".
- Almanac currently can not. Almanac types should support searching by property value.
  - This could be directly on almanac.*.search.
  - Or this could be generic, via T12799.

In either case, *.search methods are currently key-value, but property search requires a value like "piledriver.pilePHID is X". This can be accomplished with tokenizer functions, but support for multi-value functions (value-is(tag, value)) is limited. This is desirable to support value-contains(property, substring), value-exists(tag), etc.

None of this is strictly necessary, and can be approximated at little cost by discovering refs to destroy from other refs, then discovering any device named "X.phacility.net" when we destroy an EC2 resource named "X".

Piledriver would also benefit from having some functional equivalent of destroying an Almanac resource. This can be implemented as a piledriver.destroyed property, but a formal disabled state would be cleaner. PHI1331 is vaguely related.

epriestley mentioned this in T13630: Move Phacility provisioning to Piledriver.Mar 6 2021, 8:03 PM

Closing this in favor of T13630, which covers the same ground.

Rebalance Phacility instances into a private subnetClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Rebalance Phacility instances into a private subnet
Closed, ResolvedPublic
Actions