Page MenuHomePhabricator

Reduce the operational cost of a larger Phacility cluster
Closed, ResolvedPublic


The Phacility cluster has grown since launch, and hardware growth (partly driven by free instances, discussed in T12217) has accelerated recently. Currently, the production cluster is ~50 total devices. T12217 discusses some ways in which we can slow hardware growth, but even if those are fully implemented, the cluster isn't getting any smaller.

T7346 is an old task (from around launch) discussing general scalability concerns. These are currently the three most urgent concerns:

Operational Tools: We don't have a hypershell, and all cluster interaction is 1:1. Deploying the cluster involves opening up a whole pile of console windows and manually entering 100+ commands to deploy and verify everything. This is error prone and will become more error prone as the cluster continues to grow.

Beyond all the typing being error prone, there's a lack of centralized deployment logging or general encapsulation of an operational session: there's no way to go review how a deployment went, which commands were run, which errors occurred, etc.

The ideal fix is Phage ("Hypershell 2"), which we know to be a flexible tool at large scale, but it ain't super cheap to build. It would give us hypershell capabilities as part of the core, though. Phage/Hypershell are also very cool.

A less-ideal fix is a bin/remote deploy-everything command which at least fixes all the typing. This is cheaper to build.

Monitoring: Monitoring is currently very limited, and some monitoring signals (notably, free disk space and deployed versions) are on-demand. There is no screen where you can see which disks across the entire cluster are near capacity (or if any hosts have the wrong version of stuff) at a glance.

Some other things, like the instance allocator, could take advantage of this if it was centralized. The allocator currently considers shards "full" based only on how many active instances they have. It would be more powerful if it could also consider other signals, like drive fullness.

I'd ideally like to build something like Reticle integrated with Almanac and Facts: allow devices and services to have information pushed or pulled, then chart it and expose it to other consumers. This would give us some monitoring support as part of the core. I'm not sure anyone but us would use it, but integrated monitoring is so operationally useful that I think this should probably be first-party even if it sees no other use.

Disk Operations / Shard Migrations: Increasing the size of disks is manual and requires significant, uncommunicated downtime. This will be more of a problem as we start packing more instances per host.

Moving instances between shards is manual and requires significant, uncommunicated downtime.

At a minimum, these operations should become automated and communicated. Ideally, they should also become faster / less disruptive, although they aren't terribly common and it's not the end of the world if they don't get much better than they currently are. By design, we have no individual datastores which require enormously long amounts of time to copy.

Revisions and Commits

Restricted Differential Revision
rPHU libphutil

Event Timeline

I scratched at the surface of T2794, but @chad also bought me a larger monitor so I can put more SSH terminals on it which should buy us another few weeks.

epriestley added a revision: Restricted Differential Revision.Feb 17 2017, 11:18 PM
epriestley added a commit: Restricted Diffusion Commit.Feb 18 2017, 12:58 AM

@epriestley: Have you seen clustershell? It's a massively scaleable management shell with some pretty interesting features. Wikimedia operations are adopting it in place of a lot of dsh and saltstack stuff. From what I have seen it is a really solid framework to build upon.

(For anyone following along, some additional discussion of clustershell in T2794.)

I deployed the cluster with Phage yesterday with mostly good results. I hit a handful of minor bugs with Phage (some of which have diffs out now) and a few cases where planned improvements (signal handling, output control / web UI) would have been helpful, none of which was too surprising.

Phage was able to start deployments more quickly than I could manually type server name (shocking!), so we ran into a couple of cases of admin getting slammed by inefficient Conduit calls (T11506 was last related work on this, but focused on servers making inefficient calls during normal operation). I made a minor improvement which was immediately available (rCORE265aa1c5) but some of what we're doing is still way less efficient than it should be. This is currently the scalability and stability bottleneck for deployments, but should be relatively easy to fix.

I'm also likely to add "limit" (maximum parallelism) and "throttle" (artificial delay between starting commands) support to Phage mostly as generally reasonable features that move toward some other features like better retry support. They aren't the best solution here, but they're probably the simplest one.

epriestley added a commit: Restricted Diffusion Commit.Feb 19 2017, 5:01 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.

This is currently the scalability and stability bottleneck for deployments, but should be relatively easy to fix.

I fixed a big chunk of this (rCORE0ba8e621bf47, rSAAScded669ca4f2) but there's still some sort of CPU bottleneck in instances.queryinstances. Unfortunately, it's also difficult to profile Conduit method calls on for a variety of reasons: the XHProf UI is not installed there, you can't use the DarkConsole button on the Conduit console page since it's a POST request, and uploading profiles into another XHProf UI is currently a pain. I'm not sure how far I want to run down that rabbit hole.

I did the last deploy (T12316) with Phage again and things went much more smoothly after the various bugfixes and improvements.

The biggest issue I hit was that occasionally commands (like apt-get) will hang for whatever reason, and there's currently no good way to figure out which nodes need to be retried (best available approach is to grep the logs). Helpful would be:

  • Generic support for command timeouts.
  • Maybe, generic support for "consider this command timed out if it doesn't produce any output for X seconds", although I'm less pumped about that.
  • When we get ^C'd: handle the signal, shut down the agents, and print results (command failures/successes) so the operator can make better decisions about retrying failures.

For wikimedia's "scap" tool we built a feature called "checks" which are configurable commands that run on each target machine and confirm if the desired commands were successful.

Checks are essentially like unit tests for the deployment state and their exit code can fail the deploy. A failed check will prompt the operator whether to rollback or ignore.

Obviously this only works for things which can easily be tested. One straightforward example: Provide a URL in the application which will output the current version of the code. The check can confirm that the current version matches the expected version to validate that, not only was the code updated but the service restarted successfully and is able to serve a simple request without fatals.

I know this is nothing revolutionary but it's provided a lot of utility with minimal complexity in our tooling.

epriestley claimed this task.

We no longer offer free instances so tier growth is slower, and I plan to compact the tier in the nearish term.

phage is generally doing well for direct operational interactions.

Monitoring could be better but things mostly just work, so it's hard to prioritize. (We've added a few CloudWatch alarms but they mostly just raise false positives.)

We have somewhat better shard migration and disk resizing approaches now.

See also T13076 for the next phase of plans here.