Page MenuHomePhabricator

Rebalance Phacility instances into a private subnet
Closed, ResolvedPublic

Description

Previously, see T13076. Several Phacility issues are ripe for resolution via mass-migration:

  • Balancing and compaction issues.
    • Some shards are heavily loaded.
    • Free instances can be put in shard jail.
  • Instance types: everything is "m3.large" and a bunch of hosts have 1K+ days of uptime. The newer "m4.large" chassis is more or less a strict upgrade.
  • System version: current systems are on Ubuntu 14 + PHP 5.x. We can upgrade to Ubuntu 20 + PHP 7.4.
  • Subnet/NAT issues in T12816.
  • Old volume types (T12999) and sizes.

Event Timeline

epriestley created this task.

Subnet/NAT issues in T12816.

I've launched some test hardware into a fully-private subnet with a NAT gateway following the pattern in T12816, and it appears to be working well with no public IP address. I haven't put any instances on it yet and I'm not entirely sure that all the sub-services are in good shape, but most of the issues were previously cleared (notably vault / ALB issues).

System version: current systems are on Ubuntu 14 + PHP 5.x. We can upgrade to Ubuntu 20 + PHP 7.4.

I have deployment working for Ubuntu 20. The official image I'm using comes up with a lot of weird AWS/EC2 magic services which I'm a little skeptical about. The only significant issue I ran into was an upstart -> systemd switch.

Instance types: everything is "m3.large" and a bunch of hosts have 1K+ days of uptime. The newer "m4.large" chassis is more or less a strict upgrade.

As launched with the stock AMI, the m4.large instances have less space on their EBS root volume which leads to less space available to give to swap. This otherwise seems straightforward. Some pathway exists to create a similar AMI with more space, but it wasn't obvious to me how to do this.

Old volume types (T12999) and sizes.

Will be mooted by reallocating everything.

I've written some Terraform-class tooling which can likely automate all the actual hardware allocations. This needs some more work, but I believe the tricky stuff (mostly: representing resources and allowing templating to reference resources which haven't been built yet) is at least working.

Rebalancing repository shards requires automating the process in Migrating Repository Shards, but the hard stuff already works and it just needs orchestration glue.

There's less existing support for moving databases, but the process is also simpler if it's done as dump + load. If it's done as dump + load + replicate + wait + swap it's not so simple, but maybe not that bad. db shards are generally less of a problem than repo shards.

  • When Piledriver destroys a resource pile, it's helpful if it can read the entire authoritative state from sources by using only a pile ID.
    • EC2 can do this with "DescribeTags".
    • Almanac currently can not. Almanac types should support searching by property value.
      • This could be directly on almanac.*.search.
      • Or this could be generic, via T12799.

In either case, *.search methods are currently key-value, but property search requires a value like "piledriver.pilePHID is X". This can be accomplished with tokenizer functions, but support for multi-value functions (value-is(tag, value)) is limited. This is desirable to support value-contains(property, substring), value-exists(tag), etc.

None of this is strictly necessary, and can be approximated at little cost by discovering refs to destroy from other refs, then discovering any device named "X.phacility.net" when we destroy an EC2 resource named "X".

Piledriver would also benefit from having some functional equivalent of destroying an Almanac resource. This can be implemented as a piledriver.destroyed property, but a formal disabled state would be cleaner. PHI1331 is vaguely related.

Closing this in favor of T13630, which covers the same ground.