Implement Phage (like Hypershell)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Mar 20 2013, 3:06 PM

Description

The simplified Phacility architecture in T2772 suggests that having Hypershell will be useful sooner than it would have been under the Drydock architecture; I think this is arguably even a blocker. Writing push on anything shy of Hypershell ("ssh in a loop lololol") will hit a scalability wall very quickly.

Facebook made various motions around open sourcing Hypershell over the years, but I never saw any real followthrough. I think the web UI got highly specialized around when I left. It is probably far far less work to just rebuild it from scratch than open source it (I think I can write a simplified version in a day or two, and Phabricator has much better architecture now, modern ExecFuture/queryfx, etc), but maybe this is worth at least coordinating on -- @edward, any sense of what the state of the world is?

Hypershell has a few useful pieces that may be take a few iterations to replicate, notably around signal handling and the actual ssh command construction. I think I remember most of the stumbling blocks, but we'll see.

Hypershell relies on building itself into a single giant file and copying to agents. I think this feature is really good, but I'd like to come up with a smarter way to do this, although we can technically just do the same thing with libphutil if needbe. I have a couple of ideas.

Unless it's completely unreasonable, the CLI component of phage should depend only on libphutil, which may require moving a few more pieces out of Phabricator (like LiskDAO).

Revisions and Commits

rPHU libphutil
	D17463	rPHU13a200ca7621 Support timeouts for Phage commands
	D17396	rPHUaf214c017801 Add limit (maximum simultaneous commands) and throttle (delay between commands)…
	D17387	rPHUef580c9ccc21 In Phage, don't sit in a loop once we've read all messages from an agent
	D17386	rPHU567e7fb929cd Record command exit status on Execute objects in Phage
	D17385	rPHU2e26268fb854 Fix a couple bad fprintf() method calls in Phage
	D17380	rPHUdbb46e76c829 Add PhageActions to libphutil
	D17378	rPHUd09e94e4b94a Make Phage agents stream output continuously
	D17368	rPHU9f66fbd018f1 In PhutilLogFileChannel, don't log empty messages
	D17354	rPHUd37145bd144e Fix an issue where resolveKill() emits a warning if the future was never started
	Restricted Differential Revision	rPHUfd799a5ed9f8 Phage Agent outline
rARC Arcanist
	D17379	rARCdc65bfbe5434 Put a Phage skeleton command on the Arcanist experimental branch

Related Objects
Search...

		Status	Assigned	Task
		Resolved	epriestley	T12218 Reduce the operational cost of a larger Phacility cluster
		Resolved	epriestley	T2794 Implement Phage (like Hypershell)

Event Timeline

epriestley claimed this task.Mar 20 2013, 3:06 PM

epriestley triaged this task as Normal priority.

epriestley added a project: Phacility.

epriestley added subscribers: epriestley, btrahan, edward.

epriestley edited this Maniphest Task.Mar 20 2013, 3:06 PM

davidreuss added a subscriber: davidreuss.Mar 20 2013, 8:22 PM

Almost all the diffs that we've seen in hypershell have been on the logic specific to the push. Yuk! I concur that a rebuild from scratch would be a perfectly reasonable way to proceed, especially given that doing so will force a reduction in features.

Apropos of that, I look forward to proposing unwanted bloat for this new tool in upcoming diffs.

Would you consider using a different type of storage (or providing the option for different storage) for the output of phage-ran commands? When an org scales to the point where phage is used in scripts and/or run by lots of engineers many times a day, it starts to take up a lot of query time and storage space. Being able to dump that somewhere other than the Phabricator DB might be useful, though I'm being vague on quantities. When you have many thousands of machines, being able to shard that command-exit-status-and-output data across multiple child DBs (per region) is also a helpful abstraction to have, even if the master job details are still in the main Phabricator DB.

Apropos of that, I look forward to proposing unwanted bloat for this new tool in upcoming diffs.

I'm guessing that if we dumped the output in giant gzipped blobs instead of one uncommpressed row per machine it would work fine on a single DB. I think the total size of the data isn't that big, we just split it up into a huge number of very small rows. That's fine for running jobs, but the archive format was always an afterthought. IIRC, Facebook also didn't have an automated archive/delete policy (there was hsh cleanup or something, but that never was really run?) but we can do that easily through the GC daemon.

I think I have some new ideas which should reduce startup costs and remove the MySQL dependency for agents. Let me see if I can get a proof of concept built...

epriestley edited this Maniphest Task.Apr 4 2013, 12:30 PM

epriestley edited this Maniphest Task.Apr 4 2013, 5:17 PM

I want to build this some day but it isn't going to happen for a while and doesn't block SAAS.

joshuaspence added a subscriber: joshuaspence.May 31 2015, 6:17 AM

joshuaspence awarded a token.May 31 2015, 6:20 AM

eadler added a subscriber: eadler.Apr 10 2016, 6:14 AM

epriestley changed the visibility from "All Users" to "Public (No Login Required)".Feb 6 2017, 8:51 PM

epriestley changed the edit policy from "All Users" to "Community (Project)".

epriestley added a parent task: T12218: Reduce the operational cost of a larger Phacility cluster.

Per T12218, the Phacility cluster is reaching a scale (~50 devices) where we need to improve tooling over "open a bunch of terminal windows". I want to look at implementing Phage for this use case and see how far I can get.

Phage is the successor to Hypershell, which was a deployment/operations tool that I wrote at Facebook. Roughly, the tool just runs SSH commands on a large number of hosts quickly. These parts of Hypershell worked well:

Scale: Hypershell operated successfully on ~200K hosts.
Agentless: Hypershell is agentless, and does not require hosts to have (much) existing software before it can interact with them. It connects to them directly via SSH.
Copy-Then-Execute: Hypershell deploys itself ahead of executing commands, so every command executes on the same version of Hypershell you executed. This makes fixing bugs very fast: you just make a change, then run a commmand.
Web Interface: Hypershell emitted summary information to the console by default (with some flags to control behavior) and sent full details to a web UI where they could be reviewed. This meant that every command was logged by default and could be shared/examined.

I think these parts can be improved:

File-based bootstrapping: Hypershell copied itself by literally copying files and then running them. Phage copies itself by essentially opening up a REPL on the remote, then piping itself into the REPL, then executing in a run mode. All the Phage code is in-process and it doesn't need to touch the disk. This means faster startup and no weird tempfile stuff. This has a proof-of-concept today.
MySQL as command-and-control: Hypershell agents talked directly to MySQL. I plan for Phage command-and-control to occur only over the Phage connection network, and for only the originating process to interact with MySQL.
PHP required: Hypershell required PHP to be available on any potential agents. Phage will support bootstrapping in non-PHP languages, although we may not develop any first-party agents for some time. Agents will be significantly simpler than Hypershell agents were, so the required code to implement a non-PHP agent should be fairly realistic.
Various data storage decisions: Some of the data storage details in MySQL could be improved.

Additionally:

D16177 initially explores Phage as a fancier shell replacement. Some operations are potentially more convenient or more performant over the Phage channel than over raw SSH. You basically get a command line which can interact with either side of the pipe, so operations like "copy file X to location Y" don't require a separate scp process, and "copy file X to location Y on hosts A, B, C, ..." can run without extra copies and without needing to worry about cleaning up an intermediate temporary file.

I currently imagine the phage CLI living in arcanist/ and interacting with Phabricator over Conduit. The web UI will live in Phabricator, but can be optional, especially at first. Putting the CLI in arcanist/ will let us leverage extensions, project configuration, and packages (so a project could define a phage deploy action particular to that project, conceivably).

We can solve the Phacility problems with a much less ambitious approach, so if this doesn't feel like it's moving forward quickly enough I may bail and pursue something less fancy until we outscale "SSH in a loop" or "a big .sh file with every command written in order" or whatever.

epriestley mentioned this in T12218: Reduce the operational cost of a larger Phacility cluster.Feb 10 2017, 4:55 PM

epriestley added a revision: D17354: Fix an issue where resolveKill() emits a warning if the future was never started.Feb 14 2017, 3:17 PM

epriestley added a commit: rPHUd37145bd144e: Fix an issue where resolveKill() emits a warning if the future was never started.Feb 14 2017, 3:26 PM

I'm making some amount of progress here, although I've only made it slightly past "run SSH in a loop" so far.

I have reserved the coolest monogram, X, for eventual use in Phage. Only A remains.

epriestley added a revision: D17368: In PhutilLogFileChannel, don't log empty messages.Feb 16 2017, 1:50 PM

This is still a shallow ghost of Hypershell, but I successfully deployed secure with it:

$ ./bin/phage remote --hosts secure001-4 upgrade
[secure002]  REMOVE FIREWALL  Flushing iptables rules for perimeter host...
[secure002]  IP BANS  Dropping traffic from: 167.114.129.164.
[secure002]  UPGRADING LIBRARIES  Upgrading application libraries.
[secure002]  PULL  Upgrading library "libphutil" (master)...
[secure002]  PULL  Upgrading library "keystore" (master)...
[secure002]  PULL  Upgrading library "sshd" (master)...
[secure002]  PULL  Upgrading library "arcanist" (master)...
[secure002]  PULL  Upgrading library "phabricator" (master)...
[secure002]  PULL  Upgrading library "services" (master)...
[secure002]  PULL  Upgrading library "secure" (master)...
[secure004]  REMOVE FIREWALL  Flushing iptables rules for perimeter host...
[secure004]  IP BANS  Dropping traffic from: 167.114.129.164.
[secure004]  UPGRADING LIBRARIES  Upgrading application libraries.
[secure004]  PULL  Upgrading library "libphutil" (master)...
[secure004]  PULL  Upgrading library "keystore" (master)...
[secure004]  PULL  Upgrading library "sshd" (master)...
[secure004]  PULL  Upgrading library "arcanist" (master)...
[secure004]  PULL  Upgrading library "phabricator" (master)...
[secure004]  PULL  Upgrading library "services" (master)...
[secure004]  PULL  Upgrading library "secure" (master)...
[secure003]  REMOVE FIREWALL  Flushing iptables rules for perimeter host...
[secure003]  IP BANS  Dropping traffic from: 167.114.129.164.
[secure003]  UPGRADING LIBRARIES  Upgrading application libraries.
[secure003]  PULL  Upgrading library "libphutil" (master)...
[secure003]  PULL  Upgrading library "keystore" (master)...
[secure003]  PULL  Upgrading library "sshd" (master)...
[secure003]  PULL  Upgrading library "arcanist" (master)...
[secure003]  PULL  Upgrading library "phabricator" (master)...
[secure003]  PULL  Upgrading library "services" (master)...
[secure003]  PULL  Upgrading library "secure" (master)...
[secure001]  REMOVE FIREWALL  Flushing iptables rules for perimeter host...
[secure001]  IP BANS  Dropping traffic from: 167.114.129.164.
[secure001]  UPGRADING LIBRARIES  Upgrading application libraries.
[secure001]  PULL  Upgrading library "libphutil" (master)...
[secure001]  PULL  Upgrading library "keystore" (master)...
[secure001]  PULL  Upgrading library "sshd" (master)...
[secure001]  PULL  Upgrading library "arcanist" (master)...
[secure001]  PULL  Upgrading library "phabricator" (master)...
[secure001]  PULL  Upgrading library "services" (master)...
[secure001]  PULL  Upgrading library "secure" (master)...
[secure004]  MySQL ID  Assigning MySQL server ID 4000 (secure004.phacility.net).
[secure003]  MySQL ID  Assigning MySQL server ID 3000 (secure003.phacility.net).
[secure002]  MySQL ID  Assigning MySQL server ID 2000 (secure002.phacility.net).
[secure001]  MySQL ID  Assigning MySQL server ID 1000 (secure001.phacility.net).
[secure004]  RESTART HTTP  Restarting HTTP servers...
[secure002]  RESTART HTTP  Restarting HTTP servers...
[secure003]  RESTART HTTP  Restarting HTTP servers...
[secure001]  RESTART HTTP  Restarting HTTP servers...
[secure004]  RESTART APHLICT  Restarting Aphlict servers...
[secure004]  DIRECT SSH  Synchronizing direct SSH keys.
[secure004]  DONE  Host application software upgraded; restarting instances.
[secure004]  UPGRADE STORAGE  Upgrading instance storage for instance "secure"...
[secure004]  SKIPPING ROTATION  No logfile directory "/core/log/secure/phd" exists.
[secure004]  STOPPING DAEMONS  Stopping all running daemons on this device...
[secure002]  RESTART APHLICT  Restarting Aphlict servers...
[secure003]  RESTART APHLICT  Restarting Aphlict servers...
[secure002]  DIRECT SSH  Synchronizing direct SSH keys.
[secure002]  DONE  Host application software upgraded; restarting instances.
[secure002]  UPGRADE STORAGE  Upgrading instance storage for instance "secure"...
[secure001]  RESTART APHLICT  Restarting Aphlict servers...
[secure002]  SKIPPING ROTATION  No logfile directory "/core/log/secure/phd" exists.
[secure002]  STOPPING DAEMONS  Stopping all running daemons on this device...
[secure003]  DIRECT SSH  Synchronizing direct SSH keys.
[secure003]  DONE  Host application software upgraded; restarting instances.
[secure003]  UPGRADE STORAGE  Upgrading instance storage for instance "secure"...
[secure003]  SKIPPING ROTATION  No logfile directory "/core/log/secure/phd" exists.
[secure003]  STOPPING DAEMONS  Stopping all running daemons on this device...
[secure001]  DIRECT SSH  Synchronizing direct SSH keys.
[secure001]  DONE  Host application software upgraded; restarting instances.
[secure002]  RESTARTING DAEMONS  Restarting daemons for instance "secure"...
[secure001]  UPGRADE STORAGE  Upgrading instance storage for instance "secure"...
[secure001]  SKIPPING ROTATION  No logfile directory "/core/log/secure/phd" exists.
[secure001]  STOPPING DAEMONS  Stopping all running daemons on this device...
[secure002]  DONE  Restarted instance "secure".
[secure001]  RESTARTING DAEMONS  Restarting daemons for instance "secure"...
[secure001]  DONE  Restarted instance "secure".
[secure004]  RESTARTING DAEMONS  Restarting daemons for instance "secure"...
[secure004]  DONE  Restarted instance "secure".
[secure003]  RESTARTING DAEMONS  Restarting daemons for instance "secure"...
[secure003]  DONE  Restarted instance "secure".

This doesn't do the fanout/spanning tree stuff or intermediate agents yet, but it bootstraps a local agent "over the wire" by piping the code into sh -c ... and then interacts with it using the agent wire protocol, so it's doing "real" Phage work, not just faking things using FutureIterator or a foreach loop.

I don't think much of this will make it upstream for a while, but I plan to deploy the cluster with it tomorrow.

chad awarded a token.Feb 17 2017, 10:04 PM

epriestley added a revision: D17378: Make Phage agents stream output continuously.Feb 17 2017, 10:42 PM

epriestley added a revision: D17379: Put a Phage skeleton command on the Arcanist experimental branch.Feb 17 2017, 10:59 PM

epriestley added a revision: D17380: Add PhageActions to libphutil.Feb 17 2017, 11:14 PM

epriestley added a commit: rPHU9f66fbd018f1: In PhutilLogFileChannel, don't log empty messages.Feb 18 2017, 12:57 AM

epriestley added a commit: rPHUd09e94e4b94a: Make Phage agents stream output continuously.

epriestley added a commit: rPHUdbb46e76c829: Add PhageActions to libphutil.

epriestley added a commit: rARCdc65bfbe5434: Put a Phage skeleton command on the Arcanist experimental branch.

Cool stuff. I didn't see you already had made so much progress before I commented on T12218. You can safely disregard my clustershell suggestion.

Random minimally relevant observation: Hypershell is vaguely similar to ansible in that it copies it's self to the target hosts and then runs the agent over ssh.

It looks like clustershell is doing something pretty similar to Hypershell/Phage in its "gateway" mode (but it looks like it can't bootstrap itself and doesn't have a web UI).

I'm not really familiar with ansible but it looks like its pipelining mode may be very similar to Phage's new-and-improved bootstrap mode (i.e., pipe all the agent code directly over the wire, instead of copying it). It doesn't appear to have a gateway/agent mode for scaling out, however, or at least not one I can find immediately? I think this use case is pretty rare, and even clustershell treats 5,000 nodes as a "large cluster" (realistically we aren't going to have a cluster larger than this, of course). The "plan + execute" approach I'm using (which is new in Phage) looks similar to Ansible's "playbook" approach.

A legitimate criticism of Phabricator in general is that we have a strong case of Not-Invented-Here / Reinvent-Literally-Every-Wheel. In many cases, we could probably use less first-party software to achieve similar results more quickly, and any existing approach -- clustershell, ansible, or even ssh-in-a-loop with dsh -- could give us tools to attack the immediate deployment scale issues in T12218 on a shorter timeline than Phage will (and we don't actually need any of the advanced features of any of these shells, and probably won't for a long time).

But I think Phage isn't a huge project (in this case, I previously wrote Hypershell at Facebook, so I have at least some vague idea of how much work it took) and may reasonably pay for itself in the long run (I think we can get a great deal of value eventually through policy, account, Almanac, Drydock, Passphrase, Conduit, etc., integrations), and I sleep better at night with fewer external pieces in cluster operations, and it's a project I personally like working on. But we're definitely not treading any new ground here, and this might reasonably not be the most valuable thing I could be building.

We don't reinvent every wheel, and use externals in many cases where the value is obviously very high -- for example: QR code generation, Porter stemming, JSON parsing, MIME parsing, CLDR and Emojione data, the Openwall password list, Pygments highlighting, Excel spreadsheet generation, SMTP handling, payment processing, etc. But the bar for reuse-vs-reinvent definitely skews heavily toward reinvent in this project. I think that skew is mostly a defensible one, but there's certainly room for argument that we should be more open to dependencies/externals, especially in cases like this.

I totally get the motivation to have a strong collection of highly integrated tools which all work together seamlessly and provide a value that is larger than the sum of it's parts.

At Wikimedia we recently made yet another deployment tool (scap) so I'm very familiar with reinventing wheels.

Phage does look like a fun project and I hope it works out as planned. Will you be open-sourcing it or keeping it as a phacility proprietary tech?

We'll open source everything that's general-purpose. (Currently, the only piece that's closed is the "plan" for Phacility deployments specifically. Since we don't have a generic "SSH" style plan as a starting point yet that plan is currently important, but we'll write generic plans or a general plan-executor or something eventually.)

epriestley added a revision: D17385: Fix a couple bad fprintf() method calls in Phage.Feb 19 2017, 2:54 PM

epriestley added a revision: D17386: Record command exit status on Execute objects in Phage.Feb 19 2017, 2:58 PM

epriestley added a revision: D17387: In Phage, don't sit in a loop once we've read all messages from an agent.Feb 19 2017, 3:04 PM

cspeckmim added a subscriber: cspeckmim.Feb 19 2017, 4:51 PM

ftdysa added a subscriber: ftdysa.Feb 19 2017, 5:35 PM

epriestley added a commit: rPHU2e26268fb854: Fix a couple bad fprintf() method calls in Phage.Feb 19 2017, 6:54 PM

epriestley added a commit: rPHU567e7fb929cd: Record command exit status on Execute objects in Phage.

epriestley added a commit: rPHUef580c9ccc21: In Phage, don't sit in a loop once we've read all messages from an agent.

thoughtpolice awarded a token.Feb 21 2017, 8:57 PM

thoughtpolice added a subscriber: thoughtpolice.

epriestley added a revision: D17396: Add limit (maximum simultaneous commands) and throttle (delay between commands) to Phage.Feb 22 2017, 9:49 PM

epriestley added a commit: rPHUaf214c017801: Add limit (maximum simultaneous commands) and throttle (delay between commands)….Feb 22 2017, 10:11 PM

epriestley added a revision: D17463: Support timeouts for Phage commands.Mar 4 2017, 6:11 PM

epriestley added a commit: rPHU13a200ca7621: Support timeouts for Phage commands.Mar 4 2017, 6:15 PM

I've been doing production deployments with this for a while without hitting any meaningful issues. It's not general-purpose yet but followups can live in Phage from here on out.

epriestley mentioned this in Phage (Phacility).Apr 10 2017, 3:23 PM

It seem to me like certain parts of phage are not published in a public repo. Are there plans to open up missing pieces?

For instance, I found P2107 which seems to be an older snapshot of one file.

The non-public parts of Phage are currently very specific to Phacility's cluster and probably not generally useful. The current version of PhageRemoteWorkflow is similar to P2107 and depends on particular Phacility services and hosts to enumerate valid remotes and negotiate a connection to them through a bastion pool. These service-listing and bastion-host components are not generalized and not trivially generalizable.

I had imagined generalizing these components and moving them into arcanist/, but never got that far. (Part of the "toolsets" change to Arcanist was to better support alternate toolsets, like phage and piledriver -- see T13630.)

It currently seems unlikely that I'll continue this work, so I don't expect Phage or Piledriver to become available as general-purpose tools which third parties could plausibly make use of.

Thank you for the answer, appreciate it, and your effort that goes into arcanist.

I would push back on not being generally useful, a remote workflow (plus execution with agents) should be considered a primary use case of phage in general.

Implement Phage (like Hypershell)Closed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Implement Phage (like Hypershell)
Closed, ResolvedPublic
Actions

Related Objects
Search...