Page MenuHomePhabricator

Move Phacility provisioning to Piledriver
Closed, ResolvedPublic

Description

See T13542.


Stuff Piledriver could use:

  • Almanac search by property and/or property value, i.e. "find any device with property X" or "find any device with property X set to value Y". This doesn't have to be high-performance or well-indexed.
  • See PHI1331. It would be nice to have better support for a "destroyed" lifecycle stage in Almanac.

Vaguely nice to have:

  • See related T13220. API-level support for a "viewer()" policy. The default policy for Almanac devices is "Administrators", and a bot user may not satisfy this. Templates could configure an explicit policy, but "viewer()" would be a reasonable default. Piledriver can figure out the acting user PHID with user.whoami and effect this policy, but this could be cleaner with "viewer()".

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rARC Arcanist
D21732

Event Timeline

epriestley triaged this task as Normal priority.Mar 6 2021, 8:03 PM
epriestley created this task.

Can Piledriver be implemented as an Arcanist toolset?

Probably. As with Phage, there's a fairly bright line between the Phacility-specific parts and the generic parts. Separating it probably doesn't make sense right away since the two parts are mutating in tandem, but I think it can move to Arcanist once it works.

One potential issue here is that Piledriver may benefit from access to Phabricator components, like the Query classtree. This might motivate moving this tree to Arcanist.

Where is pile metadata stored?

Currently, Piledriver bootstraps resource discovery without any storage. On the one hand this is nice to have, since it means there can never be a DRY issue where the metadata says resources are in one state and the resource API says they're in a different state.

However, it limits our ability to do things like "track which template was used to build a pile" or "log changes to a pile", which are almost certainly necessary in the long run. This can probably just go into Almanac? It doesn't feel like a separate application.

There's a bootstrapping issue here, where Piledriver needs to create admin entries in Almanac but also, conceivably, needs to create admin in the first place. The admin records could exist on secure, or the CLI could maintain a soft dependency on Almanac and bootstrap itself in multiple stages. This is probably not much of an issue in practice since (at least today) you can manually drive a pile using the AWS console without significant difficulty until Piledriver can finish driving it.

epriestley added a revision: Restricted Differential Revision.Mar 29 2021, 4:44 PM
epriestley added a commit: Restricted Diffusion Commit.Mar 29 2021, 4:45 PM
epriestley added a revision: Restricted Differential Revision.Nov 18 2021, 5:15 PM
epriestley added a revision: Restricted Differential Revision.Nov 18 2021, 6:21 PM
epriestley added a revision: Restricted Differential Revision.Nov 18 2021, 6:26 PM
epriestley added a commit: Restricted Diffusion Commit.Nov 18 2021, 6:27 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.

Here's the last known state of the world from T12816:

I think the next steps are:

  • Convert notify to an ALB so we don't have to move it (and can get SSL keys off the hosts in the tier).

snlb001 was converted a long time ago (or created as an ALB) so this should, in theory, be straightforward. This doesn't need to be step 1, just to happen before the old subset switches off of NAT (if it does).

  • Launch a public subnet with an IGW rule.

Still makes sense to me. Not going to put subnets into Piledriver at this point since I have no current plans to ever make more than a handful.

  • Move vault and bastion to the public subnet. Both are stateless and can be re-launched and then shut down once we confirm the new ones work. We should probably swap both to EIPs during the move.

This should still be fine.

  • Launch a NAT gateway into the public subnet.
  • Cross fingers.
  • Update the main subnet to use the NAT instead of the IGW.
  • (If possible, strip IPv4 public IPs from existing hosts? Pretty sure there is no way to do this.)

I think this plan of action should mostly be independent of other work, and the only immediate benefit is the ability to provide a stable address range (T11336). I suspect it makes sense to pursue before we do any actual work on hardware for private clusters, but doesn't otherwise need to happen particularly soon.


Since my main goal is now just to compact the infrastructure and bleed off extra hardware, with no real intent to do private clusters, I think this modified line of attack likely makes more sense:

  • (Manual) Launch a public subnet with an IGW rule.
  • (Manual) Launch a NAT gateway into the public subnet.
  • (Manual) Launch a private subnet.
  • (Piledriver) Provision a next-generation repo host into the private subnet. Migrate meta onto it. Migrate ~100 free instances onto it. Repeat until all free instances are on private subnet hosts. Balance paid instances onto other hosts sensibly. Decommission the entire repo tier.
  • (Piledriver) Provision a next-generation db host into the private subnet, etc., etc.

We have a lot of leftover VPC cruft that I'm going to nuke, notably meta and admin VPCs that (as far as I can tell) have nothing in them, and then a bunch of subnets (meta.private-a, meta.private-b, block-public-222, admin.public-a, admin.public-b, meta.public-a, meta.public-b, block-private-3) and some NGWs etc. I'm like 99% sure this stuff is all leftover from testing years ago and nothing depends on it, but I guess we'll see what happens when I delete all of it.

I got rid of everything I could, and nothing appears to be affected.

There's one VPC (meta) left in N. California, that I can't figure out how to delete because its subnets are still bound to auto-allocated IPv4 addresses that I "don't have permission" to delete, even as a root account. All the information I can find about this on the internet is the obvious stuff (e.g., user actually doesn't have permission) so who knows. They aren't really hurting anything so I'm going to leave them alone for now.

I can't figure out how to delete...

There were still some LBs kicking around attached to the subnets of this VPC, and I was able to get rid of the VPC after nuking the LBs.

Piledriver was built before the FutureGraph stuff settled in T11968; it runs into the same general set of sequencing problems and yield would likely be a good approach.

For example, destroying an EBS volume may require first detaching it, waiting, then destroying it. Without yield this implements as a messy switch ($this->currentDestructionPhase) { ... } sort of "functor"-flavored mess.

Only EBS volumes are currently affected and the impact is small (exactly one small functor-flavored mess) so I don't expect to do anything about this, but if Piledriver ever becomes an Arcanist Toolset it would also benefit from consolidation into HardpointEngine or some generalized yield-based graph executor.


Record updates in Route53 seem to be fairly complicated -- the API is ChangeResourceRecordSets and takes a giant blob of XML, unlike most other APIs -- and using DNS at all for host resolution is kind of a hack that I'd like (in a perfect world) to move away from anyway. I'm just planning to do this bit manually for now.


The whole VPC/subnet dance is somewhat pointless without intent to pursue private clusters -- no meaningful network isolation is achieved -- but it does technically work now, and I suppose private clusters are actually sort of trivial?

We do get: no more public addresses for servers; and a consistent EIP for outbound traffic origination, so that's not literally nothing.


Piledriver can now do EC2 instances, EBS volumes, volume attachment, Almanac devices, and Almanac interfaces. This is nearly all of the busywork, and I think that's more or less good enough and I can start launching machines and solving the other 90% of issues that arise.

epriestley added a revision: Restricted Differential Revision.Nov 19 2021, 10:24 PM
epriestley added a commit: Restricted Diffusion Commit.Nov 19 2021, 10:25 PM

The new provisioning process for repository shards is:

  • (Piledriver) Create a repository shard.
  • Set up the Route53 record.
  • bin/remote deploy + shutdown -r + bin/remote deploy it. It would be sort of nice if deploy could do this automatically, but that seems kind of tough since there is currently no Phage side-channel and I don't seen an obvious/trivial way to build one.
  • Create an appropriate Almanac service, or add it to an existing service.

This document is up to date:

https://secure.phabricator.com/w/phacility_cluster/migrating-repositories/

...but bulk migrating is pretty painful.


I updated bin/instances move to mostly go through the manual steps automatically, and brought almost all of the repositories on meta over. It currently fails when a repository has no working copy on disk. This is easy to fix, but requires a deploy to source shards. For the moment, I just made it continue on failure and exit with an error at the end.


The web views of repositories work properly, but SSH access does not. Ubuntu20 is using a patched sshd (for ForceUser) based on OpenSSH 7.9p1. This particular sshd binary drops connections unhelpfully even when the connection process and configuration are significantly simplified and everything is run under sshd -ddd, ssh -vvv, etc.

OpenSSH 8.8p1 works mostly fine, but requires these lines in the configuration file to accept connections from current hosts:

HostKeyAlgorithms +ssh-rsa
PubkeyAcceptedAlgorithms +ssh-rsa

These interfaces are internal-only so it's not really important if we're using less-modern ciphers. The general story here seems to be:

  • SHA-1 is broken-in-theory, although at the level of "a collision has been disclosed", not like "you can run this script to generate a chosen-prefix collision for an arbitrary SHA-1 hash". See also T12509.
  • The ssh-rsa algorithm uses SHA-1, so recent versions of OpenSSH are deprecating it.

So this would be nice to update at some point (i.e., get all internal traffic off ssh-rsa) but isn't a gaping hole today.

Beyond that, the ForceUser patch seems to apply cleanly to OpenSSH 8.8p1 and run properly.

iiam

...so I'm just going to use 8.8p1 as the basis for the Ubuntu20 cluster sshd binary, and give Ubuntu20 a separate SSHD config (these options are not backward-compatible).

Move vault and...

Just for completeness, vault used to be an HAProxy host serving as an SSH load balancer, but this responsibility moved to lb001 once ELBs became able to listen on inbound port 22 and TCP forward, so there is no longer a vault class of machines.

I completed all the repository migrations over the weekend and seemingly haven't run into any issues.

Database migrations are next. These have less existing infrastructure, although the overall pattern is the same. There's some setup required:

  • PhacilityDatabaseRef currently has only a host property.
    • Internal population inside InstancesStateQuery uses a string, not a dictionary. This must become a dictionary. Wire format is already fine (a dictionary).
    • It needs a readonly property.
  • SiteSource chooses the first database host when multiple are present. The first host is arbitrary (likely the newest, in practice).
    • PhacilityDatabaseRef should probably get a disabled property. SiteSource can filter disabled hosts and pick an arbitrary non-disabled host, but should probably just fatal when there is not exactly one acceptable host.
    • Updating services is atomic (or "atomic enough", at least): property changes and service connections are both edges on Instance.

So this looks like:

  • Make InstancesStateQuery use a dictionary when building the database ref information internally.
  • Add a readonly property and a disabled property in the Instances code.
  • Add support for readonly and disabled in DatabaseRef.
  • Add support for readonly and disabled in SiteSource.

As long as none of these changes go to instance hosts before they go to admin, there's no sequencing problem here.

Then the move script can do databases by:

  • putting the old service in readonly mode;
  • dumping a backup on the old host;
  • loading it on the new host;
  • connecting the new service and disabling the old service.

The "most correct" way to do this in theory, since shared instance database hosts are not replicated, is something like this:

  • configure a binlog for only the migrating databases;
  • dump the databases;
  • load them on the new host;
  • configure replication;
  • wait for catchup;
  • drop to readonly mode; confirm catchup; swap masters; stop replication; exit readonly mode;
  • deconfigure binlogs.

I think this requires service restarts, has complexity related to mixed MyISAM / InnoDB tables, and is generally a big complicated mess to avoid a relatively small amount of read-only quasi-downtime.

A half-effective approach is to do the simple readonly move, but don't dump indexes. Then, reindex after loading the data. This reduces the size of the dump and the cost of the data load, in exchange for breaking search for a while.

Since >95% of instances are free instances and none of them are supposed to have a significant amount of data, I expect to do ~95% of this with the simple move and then look at the rest of it. In cases where a source host has only one paid instance after free instances are migrated, it may make sense to be more selective.

Piledriver also needs to be able to provision database hosts, but these are more-or-less a trivial subset of repository hosts.

epriestley added a revision: Restricted Differential Revision.Dec 1 2021, 9:25 PM
epriestley added a revision: Restricted Differential Revision.Dec 1 2021, 9:34 PM
epriestley added a revision: Restricted Differential Revision.Dec 1 2021, 10:44 PM
epriestley added a revision: Restricted Differential Revision.Dec 1 2021, 11:03 PM
  • Make InstancesStateQuery use a dictionary when building the database ref information internally.

instances.state currently emits a databases list (just addresses) and a services list (relatively complete service information). Rather than make databases much heavier, I'm just going to get rid of it and make services a little heavier -- this ends up in a simpler state. This is in D21735.

  • Add a readonly property and a disabled property in the Instances code.

Done, in D21734.

  • Add support for readonly and disabled in SiteSource.

Done, in D21736.

  • Add support for readonly and disabled in DatabaseRef.

Done, in D21737. This is a slightly more complicated change than planned: instead, it virtualizes the DatabaseRef list as a view of the ServiceRef list that includes only database services. Net effect is the same API and a simpler and more consistent wire format.

This change series is still non-breaking as long as it goes to admin first, but it touches all of core/, services/, and instances/ and feels risky-ish. Since my availability is a wildcard these days, I'm going to hold it until the weekend and try deploying then if things look calm on my end.

Piledriver also needs to be able to provision database hosts, but these are more-or-less a trivial subset of repository hosts.

This seems mostly simple, except that Ubuntu20 has the new "MySQL comes up with no way to log in" version of MySQL that you're supposed to use mysql_secure_installation to activate.

Some clever users have found ways to automate it, like this!

root_temp_pass=$(sudo grep 'A temporary password' /var/log/mysqld.log |tail -1 |awk '{split($0,a,": "); print a[2]}')
https://stackoverflow.com/questions/24270733/automate-mysql-secure-installation-with-echo-command-via-a-shell-script/35004940

On closer examination, isn't actually this particularly silly flavor of nonsense -- it just comes up with auth_socket that requires you be logged in as root when connecting as root. That's tentatively okay and the relevant parts of the workflow can necessarily sudo (to, e.g., restart the service) so I sprinkled some sudo into the MySQL deployment process.

That seems to have worked, so Piledriver can now provision db or repo hosts.

While waiting to deploy db stuff, I was planning to look at pruning dead data out of S3 -- but, on closer examination, the total S3 bill is something like $1/day, so no priority on that whatsoever.

epriestley added a commit: Restricted Diffusion Commit.Dec 4 2021, 9:05 PM
epriestley added a commit: Restricted Diffusion Commit.

I'm going to hold it until the weekend and try deploying then if things look calm on my end.

I'm rolling this out now. If everything goes well, there will be no behavioral changes.

If this goes fairly cleanly, I'm also planning to try to roll out selective outbound mail routing through Postmark (see T13669) and migrating the stateless web tier to m4.large hosts, but we'll see how far I get.

epriestley added a commit: Restricted Diffusion Commit.Dec 4 2021, 10:29 PM
epriestley added a commit: Restricted Diffusion Commit.Dec 4 2021, 10:41 PM
epriestley added a commit: Restricted Diffusion Commit.

The latest version of Phabricator itself is everywhere.

The new instances.state API is on admin.

The new core/ support for the API is partially deployed; the new services/ support isn't anywhere yet. These are likely rolling out next but I'm getting a bit tight on time at this point so I may not risk digging a hole that I don't have time to dig out of.

I swapped the whole web tier to m4.large (and resized it a bit).

Postmark API keys are now available, but not yet in use anywhere.

epriestley added a commit: Restricted Diffusion Commit.Dec 5 2021, 10:30 PM
epriestley added a commit: Restricted Diffusion Commit.

The new core/ support for the API is partially deployed; the new services/ support isn't anywhere yet.

These are now everywhere.

That's tentatively okay and the relevant parts of the workflow can necessarily sudo (to, e.g., restart the service) so I sprinkled some sudo into the MySQL deployment process.

This isn't actually okay because some restart pathways need to build queries, and the qsprintf*() family needs a $conn to do that. We can do this to fix it:

mysql> ALTER USER root@localhost IDENTIFIED WITH caching_sha2_password BY "";

Next problem: the new mysqld loads these config directories:

!includedir /etc/mysql/conf.d/
!includedir /etc/mysql/mysql.conf.d/

We dump config files into conf.d/, but that is now superseded by the new mysql.conf.d/.

By default, mysql.conf.d/ has a config file with bind_address = 127.0.0.1, which prevents non-local connections. We already have bind_address = 0.0.0.0 in the Phacility config but it is now overwritten.

I'm sure there's a fantastic explanation for this somewhere.

I'm just going to make the install script destroy mysql.conf.d/mysqld.cnf if it exists, since this generally converges us to the simplest configuration.

Finally, there are other some MySQL version issues which can be avoided with:

character_set_server=utf8
default_authentication_plugin=mysql_native_password

Once all hosts are on Ubuntu20/MySQL 8, these can be removed.

I put all the database migration stuff everywhere and it appears stable. I'm hooking up Postmark as an outbound pathway now. If I get that working, I'll let it sit for a while and start migrating databases.

Currently, Phacility uses a "From" address of noreply@<instance>.phacility.com as a default sender. This doesn't work by default with Postmark (which requires each domain be configured individually). It could be made to work by adding DNS entries for every domain (yikes), but I think a much simpler fix is to always send from noreply@noreply.phacility.com and just reserve that instance name. No one is supposed to reply to this address anyway, obviously.

Almost every host currently in production was provisioned with Piledriver and things have been stable for quite a while, so I'm calling this resolved. See elsewhere for issues with Ubuntu20, mail, etc.