Move build step implementations to Arcanist and started work on build agents.
AbandonedPublic
Actions

Authored by hach-que on Nov 6 2013, 10:43 AM.

Details

Reviewers

epriestley

Group Reviewers

Blessed Reviewers

Maniphest Tasks

T1049: Implement Harbormaster

Summary

This diff (and the relevant Phabricator diff) does three things:

Migrates BuildStepImplementation and SleepBuildStepImplementation into Arcanist.
Introduces arc agent; the intention is that this will be run on build agents to listen for and run builds.
Introduces agent.nextbuild Conduit call, which currently just states there's no builds to pick up.

I am not quite sure how we're going to reconcile the worker (which picks up builds immediately), and agents which have to poll (because Phabricator has no direct way of sending a message to the agent immediately).

Test Plan

Applied the relevant changes to both Phabricator and Arcanist. Ran arc agent and it responded with 'No build to pick up.'

Verified in Phabricator that everything still works (adding build steps, etc.)

Diff Detail

Branch

move-build-step-impl

Lint

Lint Passed

Severity	Location	Code	Message
Advice	src/workflow/ArcanistBuildAgentWorkflow.php:66	XHP16	TODO Comment

Unit

No Test Coverage

Event Timeline

I'm not sure about this approach. In particular, I don't like the requirement that arc agent run on build hosts. One motivation is that I'd like the Phabricator ecosystem to require you to install PHP on as few machines as possible, since a lot of users really hate PHP. Another is that we can't fix arc agent if it dies, and we can't make it very easy to upgrade arc agent.

My intended approach for triggering command-based remote build steps was to have the workers just ssh to the build host and run the command via SSH. If the build step is an arc step, the ssh would then run arc unit --build or whatever. If not, it could run any other command. The machine wouldn't have to have PHP or have a daemon/agent running, or any credentials, and can be firewalled or otherwise isolated so that it can't make any interesting outbound connections and doesn't have any credentials on it. All of these attributes seem very desirable.

What's the driving force behind this?

We running quite complex (cross)builds on build hosts, having an arcanist requirement on build hosts makes the things not easier I agree. Having Herald to trigger the build is fine.

My intended approach for triggering command-based remote build steps was to have the workers just ssh to the build host and run the command via SSH.

Just as a data point here, we definitely need to be able to run builds on Windows systems as well.

This approach also seems like a really natural model for lots of native-code Open Source projects, where in a number of cases (this is common with buildbot, I don't know about other systems) it's users that donate the systems builds run on, hosted on their own premises - and in general it would be impossible for Phabricator to connect directly to these.

Maybe both models should be supported?

Is php much easier to install than sshd on Windows? I installed some sshd on my Windows machine some time ago and use it to debug arc stuff, and I don't remember it being very complicated, but maybe there are reasons this isn't viable.

Or is the issue more around making inbound connections to random users' hosts? There, is the issue that these users not sophisticated enough to configure inbound NAT, or that they don't have control over the inbound network, or that they have dynamic IPs and can't reliably be connected to, or something else?

Is php much easier to install than sshd on Windows? I installed some sshd on my Windows machine some time ago and use it to debug arc stuff, and I don't remember it being very complicated, but maybe there are reasons this isn't viable.

Installing PHP on Windows (to be standalone rather than for a webserver) is pretty trivial and is well documented and supported (this draws strong parallels with running a buildbot slave on Windows, just PHP rather than Python). As for installing a SSH daemon, there seems to be a few common ways (OpenSSHd with Cygwin, or freeSSHd), but asking users to install a remote access solution and making it internet accessible seems slightly terrifying - although this would be tenable for ourselves where we're only currently building on an internal VM.

MariaDB (just to pick a popular project doing this at random) provide information on running build commands on Windows via SSH (however in this case it's actually a pull client, the buildbot slave, running on another system in the internal network and "translating" the commands) as a fallback to being unable to run the buildbot slave directly on Windows, and have listed a number of caveats with the approach: https://mariadb.com/kb/en/Buildbot_Setup_for_Windows/

I may be sticking to what I know too much, and there may be some awesome other reason to have SSH be required, but "running a daemon" rather than "connect to random machine" seems quite a bit saner - but I guess like you've mentioned, it depends on organization topology (i.e. excluding the build machines from making external connections - in our case we're FTPing artifacts from the build slave as part of the build process, so it's moot).

Or is the issue more around making inbound connections to random users' hosts? There, is the issue that these users not sophisticated enough to configure inbound NAT, or that they don't have control over the inbound network, or that they have dynamic IPs and can't reliably be connected to, or something else?

(As mentioned earlier we don't run this setup, we currently have enough of our own hardware to run the 3 build machines we need)

Dynamic IPs are definitely a huge problem with this, but with the IPv4 address space exhaustion, CGN is starting to be implemented (here in the UK, without even offering an IPv6 allocation) which would also preclude this.

Configuration wise, one of these looks more complicated than the other:

Get a static IP address or setup dynamic dns, Install a SSH daemon on the machine (which may entail installing Cygwin or disabling various security measures such as UAC), Configure internal network to forward a port to that machine and make sure it's open in the local firewall, Give us the address and login details. Keep us updated if anything changes.

Install PHP and Arcanist (both of which have decent documentation for Windows), Put this arcrc with the credentials of the System Agent we configured for you here, Make sure this command gets run on boot. We'll let you know if anything needs to change.

Hell, even with our internal VMs I'd trust the 2nd process to go better.

Windows is definitely one of the major driving forces here; remoting to Windows clients is pretty impractical (even using PowerShell automation of Windows remotely is not very nice). The "push PHP to some random server" also doesn't work as well when you have pre-allocated Linux agents.

Of all of the build systems I've seen and used (BuildBot, Jenkins and Bamboo), they all get you to install the build agent on the box and then it connects to the build master. For Elastic Bamboo (where Bamboo runs agents on EC2), they get you to install the build agent, take an AMI and then Bamboo uses the AMI to spin up new instances. This approach seems to work generally well and will work across all systems (Linux and Windows).

The other advantage of putting the step implementation logic into Arcanist is that it means the steps don't have to be aware of remoting over to another machine. I thought about this for a while and if the step implementations had to explicitly do "run this remote command", then very quickly the step implementations are just all going to end up being giant piles of "run this remote command", instead of focusing on the actual task at hand (running unit tests or running build tools).

I don't think the upgrade issue is really there either; if we need to make an incompatible change to say agent.nextbuild, then we could pass a version parameter to the Conduit call. If the Arcanist version is older than we expect, we tell it to run a build whose only step is "arc upgrade and restart", which as it says, runs "arc upgrade" and gets the build agent to restart itself.

I'd like us to support some kind of "master to agent push" if the agent has a public IP address and some random port is open; in that case we should be serving up a simple HTTP server from within arc agent so that Phabricator can push builds immediately to a box, but we should also support a pull / polling method in case that isn't practical.

With build boxes behind NAT and Firewalls you always will have problems of connecting them. Is there any option for a bisectional protocol to use here? I.e. a XMPP (or many other protocols) can be used to establish a connection from the builder to the master, which is very NAT friendly, but push commands over this connection to the build. HTTP is not always the best choice for such scenarios...

@epriestley and I discussed this on IRC and we decided that it'd be better to try agent-less builds first and see how that goes; if users are encountering major issues using agent-less builds then we'll look into doing build agents.