Based on unscientific gut feelings, I think we have enough instances now to reasonably expand the repo/db tiers. I'm planning to:
- Put repo002 into service.
- Put db002 into service.
- Change configuration so new instances allocate on those hosts.
Based on unscientific gut feelings, I think we have enough instances now to reasonably expand the repo/db tiers. I'm planning to:
Restricted Diffusion Commit | |||
rP Phabricator | |||
D13460 | rP7c3a7f47cdf6 Fix Alamanc service paging |
I'm starting this now. This should ideally be routine, but I expect there will be a few hiccups.
I ran into these issues:
We don't actually install MySQL onto hosts in the db role.
I added mysql-server to the things we install for hosts with run.mysql as an attribute, in rCOREcad6a811.
This got us further, but we had a non-state-based check for initializing /core/data/mysql, so the deployment was not fully resumable after a failure between binding the volume and initializing the MySQL data directory. I converted this to a state-based check in rCOREbbe431ca.
This cleared MySQL deployment and just left us with the "no instances" issue on both hosts.
When a new host comes up with no instances, we don't handle it cleanly at the tail end of deployment.
This cleared MySQL deployment and just left us with the "no instances" issue on both hosts.
Here's the trace for this:
[2015-06-26 18:38:17] EXCEPTION: (Exception) Expected exactly one result, got none. at [<libcore>/workflow/host/CoreHostWorkflow.php:544] libcore(), phutil(head=stable, ref.master=992abe4a420c, ref.stable=7a8f9e361585) #0 CoreHostWorkflow::readOneConduitResult(array) called at [<libcore>/workflow/host/CoreHostWorkflow.php:385] #1 CoreHostWorkflow::getHostInstances(NULL, boolean) called at [<libcore>/workflow/host/CoreHostWorkflow.php:330] #2 CoreHostWorkflow::selectInstances(PhutilArgumentParser) called at [<libcore>/workflow/host/CoreHostRestartWorkflow.php:19] #3 CoreHostRestartWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:406] #4 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:301] #5 PhutilArgumentParser::parseWorkflows(array) called at [/core/scripts/host.php:19]
This isn't really a bug -- we aren't looking for instances as I suspected earlier, we're looking for the device entry. It seems reasonable and correct that the device entry should be created before performing a deployment, so I made the loadOne...() semantics more similar to executeOne() semantics and specialized the error to be more clear, in rCORE87b7b11.
This gave both hosts clean deploys, terminating with a meaningful error about missing Almanac devices, which seems reasonable and likely desirable.
When creating device entries, I discovered AlmanacServiceQuery has a bug where name is not returned in the paging map. This prevents paging when an install has more than 100 service entries.
Paging is not critical to creating these services, so this doesn't block anything, although it's a little inconvenient. D13460 fixes this issue.
Next, I'll put these hosts into service.
I think this failed because MySQL is silently ignoring our configuration, and thus not actually listening on 3306. I'm suspicious this may be an AppArmor thing...
Yes, AppArmor.
Jun 26 18:38:15 db002 kernel: [ 1844.819416] type=1400 audit(1435343895.006:33): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/core/conf/mysql/my.cnf" pid=5526 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=1000 Jun 26 18:38:15 db002 kernel: [ 1844.825878] type=1400 audit(1435343895.014:34): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/core/conf/mysql/my.cnf" pid=5536 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=1000
This took a huge amount of manual fiddling last time so it's not terribly surprising to me that I failed to capture some aspect of the required state.
AppArmor uses file modification time heuristics to decide whether it can read a cache or not. This heuristic guesses the wrong result in the deployment case. I added -T to stop it from using the cache in rCORE8371c05b. Instance upgrade/sync operations now work correctly.
I'm now going to close the 001 hosts for new allocations, so new instances will go on these new machines.
So I think this ultimately went cleanly.
Stuff that went poorly:
Stuff that went well: