Page MenuHomePhabricator

Deploy Almanac v2 Changes to the Cluster
Closed, ResolvedPublic

Description

Log for cluster deployment of Almanac v2 API changes to the Phacility cluster.

I'm deploying some API changes to the cluster this morning. I anticipate this may involve about an hour of downtime. T10246 has more context and motivation.

These changes substantially impact how the cluster operates and are difficult to test exhaustively in a development environment, so I anticipate a possibility of encountering unforeseen issues that need to be corrected once they start reaching production services. This may result in more downtime than deploys usually require.

If everything just works right away this will be an almost completely normal deployment (there are some minor migrations with configuration that need to be performed), but there's enough uncertainty in the production behavior of these changes that I expect to hit some minor issues and need to work through them before normal operation resumes.

Event Timeline

I'm getting this underway now. I'm going to preemptively stop all daemons, which will pause repository imports, outgoing mail, etc. They'll be restarted at the end of the process.

I've stopped the daemons. I'm going to stop the web tier, deploy admin, fix instance configuration, then deploy the web tier. I expect this to be the riskiest part of the upgrade.

  • I stopped web.
  • I deployed admin.
  • I manually removed now-invalid configuration from instances and from Alamanac properties.
  • I spot-checked instance/Almanac service relationships.
  • I spot-checked Almanac service properties.
  • I manually invoked almanac.service.search to look for issues, but everything looks good.

I'm going to deploy web now, then see if it comes back up without any help.

web is sort of back up, but needs the new schema to actually make any progress, so this isn't conclusive. I'm deploying the db tier now to get the new schema.

I've upgraded db001 and sorted out host instances and service sync flows, which seem OK now. I'm going to upgrade the rest of the db tier.

db tier is deployed, now deploying repo001.

repo001 seems to have gone smoothly and daemons/repositories are now restored, I'm deploying the remainder of the tier.

I've deployed the remainder of the repo tier, and the auxiliary service tiers. I believe deployment is essentially complete.

I'm going to run through some additional verification steps now (admin daemon queue, launching a new instance, etc).

Daemons look good -- we accumulated a bit of a queue of backup tasks while the tier was out of sync, but these have now almost completely flushed on their own.

We have one failing MetaMTA task because someone entered a bad email address, but that's unrelated to the deployment.

Also, the backup changes from last week (rCOREbf0c93a46f8b) seem to have worked correctly. The instances that were hitting problems and delays have written backups exactly on schedule every day since then.

epriestley claimed this task.

Test instance workflow looks good, and general monitoring also looks clean. I'll keep an eye on things but I think we're in the clear.