Page MenuHomePhabricator

Expand Phacility cluster repo and db cluster tiers
Closed, ResolvedPublic

Description

Based on unscientific gut feelings, I think we have enough instances now to reasonably expand the repo/db tiers. I'm planning to:

  • Put repo002 into service.
  • Put db002 into service.
  • Change configuration so new instances allocate on those hosts.

Revisions and Commits

rP Phabricator
D13460

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added projects: Phacility, Ops.
epriestley added subscribers: epriestley, btrahan.

I'm starting this now. This should ideally be routine, but I expect there will be a few hiccups.

  • I launched repo002 and db002 into the cluster.
  • I created .phacility.net DNS entries for the hosts.
  • I created ddata002, dbak002, rdata002 and rbak002 volumes and attached them according to Adding Cluster Hardware.
  • I used bin/remote deploy <host> to deploy each host. Both hosts deployed mostly cleanly.

I ran into these issues:

  • DNS and volume steps could be better automated, as they currently involve a lot of manual UI stuff.
    • I'm not planning to expand the tooling into AWS for a while.
  • git clone hung when deploying db002.
    • Failed to reproduce when deploy was restarted. I'm just going to ignore this unless I see it again.
  • We don't actually install MySQL onto hosts in the db role.
    • I'll fix this shortly.
  • When a new host comes up with no instances, we don't handle it cleanly at the tail end of deployment.
    • I'll fix this shortly.

We don't actually install MySQL onto hosts in the db role.

I added mysql-server to the things we install for hosts with run.mysql as an attribute, in rCOREcad6a811.

This got us further, but we had a non-state-based check for initializing /core/data/mysql, so the deployment was not fully resumable after a failure between binding the volume and initializing the MySQL data directory. I converted this to a state-based check in rCOREbbe431ca.

This cleared MySQL deployment and just left us with the "no instances" issue on both hosts.

When a new host comes up with no instances, we don't handle it cleanly at the tail end of deployment.
This cleared MySQL deployment and just left us with the "no instances" issue on both hosts.

Here's the trace for this:

[2015-06-26 18:38:17] EXCEPTION: (Exception) Expected exactly one result, got none. at [<libcore>/workflow/host/CoreHostWorkflow.php:544]
libcore(), phutil(head=stable, ref.master=992abe4a420c, ref.stable=7a8f9e361585)
  #0 CoreHostWorkflow::readOneConduitResult(array) called at [<libcore>/workflow/host/CoreHostWorkflow.php:385]
  #1 CoreHostWorkflow::getHostInstances(NULL, boolean) called at [<libcore>/workflow/host/CoreHostWorkflow.php:330]
  #2 CoreHostWorkflow::selectInstances(PhutilArgumentParser) called at [<libcore>/workflow/host/CoreHostRestartWorkflow.php:19]
  #3 CoreHostRestartWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:406]
  #4 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:301]
  #5 PhutilArgumentParser::parseWorkflows(array) called at [/core/scripts/host.php:19]

This isn't really a bug -- we aren't looking for instances as I suspected earlier, we're looking for the device entry. It seems reasonable and correct that the device entry should be created before performing a deployment, so I made the loadOne...() semantics more similar to executeOne() semantics and specialized the error to be more clear, in rCORE87b7b11.

This gave both hosts clean deploys, terminating with a meaningful error about missing Almanac devices, which seems reasonable and likely desirable.

When creating device entries, I discovered AlmanacServiceQuery has a bug where name is not returned in the paging map. This prevents paging when an install has more than 100 service entries.

Paging is not critical to creating these services, so this doesn't block anything, although it's a little inconvenient. D13460 fixes this issue.

  • I created Almanac device entries for these hosts, following db001 and repo001. Both deploys went cleanly once the device entries were created.
  • I used remote ssh <host> to connect to the devices and verified connectivity, volume attachment, deployed libraries, branches, that mysql works on db002, iptables rules, and that nothing appeared otherwise funky or broken. I didn't catch anything suspicious, except that remote deploy doesn't initialize or mount the backup volume: we do that when we actually perform backups. It probably should, just to catch misconfiguration/errors earlier. For now, I'll do this manually later on. I filed T8688 to follow up.

Next, I'll put these hosts into service.

  • I created Almanac service entries for repox002 and dbx002, following the 001 templates. Notably, repo bindings need a protocol specified. I didn't specify "instances.open" on these yet.
  • I created a new instance and verified it came up on repox001 and dbx001 still, since the new services aren't open.
  • I manually edited the instance to move it to the 002 services, then tried to upgrade the database. This failed.

I think this failed because MySQL is silently ignoring our configuration, and thus not actually listening on 3306. I'm suspicious this may be an AppArmor thing...

Yes, AppArmor.

Jun 26 18:38:15 db002 kernel: [ 1844.819416] type=1400 audit(1435343895.006:33): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/core/conf/mysql/my.cnf" pid=5526 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=1000
Jun 26 18:38:15 db002 kernel: [ 1844.825878] type=1400 audit(1435343895.014:34): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/core/conf/mysql/my.cnf" pid=5536 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=1000

This took a huge amount of manual fiddling last time so it's not terribly surprising to me that I failed to capture some aspect of the required state.

epriestley added a commit: Restricted Diffusion Commit.Jun 26 2015, 7:45 PM

AppArmor uses file modification time heuristics to decide whether it can read a cache or not. This heuristic guesses the wrong result in the deployment case. I added -T to stop it from using the cache in rCORE8371c05b. Instance upgrade/sync operations now work correctly.

  • I connected to the test instance, enabled Darkconsole, verified it was hitting the right DB, etc.
  • I created, cloned, and pushed a repository.
  • I ran host backup explicitly on both machines. This wanted device entries for the volumes for logging. I created them; this completed smoothly afterward.
  • I verified that the backups appeared in the web UI log, that the repository was on the correct host, etc.

I'm now going to close the 001 hosts for new allocations, so new instances will go on these new machines.

  • I set instances.open to 0 on the 001 services and to 1 on the 002 services.
  • I created a new instance and verified it came up on the 002 services.
  • I verified that the test-on-002 instance still worked. Verified that a known 001-instance (meta) still works.

So I think this ultimately went cleanly.

epriestley closed this task as Resolved.EditedJun 26 2015, 8:04 PM
epriestley claimed this task.

Stuff that went poorly:

  • AppArmor. I think this will be stable now, but if it continues causing issues I'm just going to turn it off. The tiny hypothetical value is in no way worth the number of hours I've sunk into configuring it, and configuration problems are difficult to detect/diagnose/understand because they manifest as nearly-silent failures.
  • Some minor issues where things weren't quite properly state-based which caused deployment to not be ver repeatable this time, but I believe they're all fixed and the next tier expansion will be smooth. This was about what I expected from bringing up the 002 hosts, but would be concerning if it hasn't dropped to ~nothing on the 003/004 hosts.
  • Lots of opportunity to make this faster by automating the manual steps in configuring EBS, Route 53, and Almanac. These were all pretty straightforward and nothing felt error-prone, but eventually this could all be automated away.

Stuff that went well:

  • Actual core cluster scalability, which seems to have just worked out of the box exactly as designed.