Page MenuHomePhabricator

Design the Drydock Blueprint selection mechanism
Closed, ResolvedPublic

Assigned To
None
Authored By
epriestley
Oct 6 2015, 11:24 AM
Referenced Files
F878287: Screen Shot 2015-10-14 at 12.48.48 PM.png
Oct 14 2015, 7:53 PM
F865081: pasted_file
Oct 6 2015, 1:18 PM
F865071: pasted_file
Oct 6 2015, 1:18 PM

Description

Some discussion in T9252 and T182. When some other application tries to acquire a lease on a Drydock resource (like Harbormaster performing a build, or Diffusion performing a merge), we need a mechanism to decide which blueprints the application may use.

In particular, there are security/trust concerns in the T182 use case. We can't let merges and builds share the same resources, because builds are less-trusted and may compromise merges. In general, these properties should hold:

  • Merges must only acquire high-trust resources.
  • Builds must never acquire high-trust resources.

But I think it goes further than this: I probably should not be able to compromise merges by adding a new blueprint that I claim is "high trust". The broader version of this is that it's surprising if you spin up a new tier for Android builds and the whole thing is instantly saturated by iPhone builds just because the hosts are technically compatible. This is bad on both sides: it's not what you wanted, and it's not what whoever owns the iPhone builds wanted.

So it's bad if jobs use surprising blueprints, and it's bad if blueprints are used by surprising jobs.

This makes me lean toward requiring explicit, opt-in authorization on both ends of the relationship: build plan X would explicitly say that it's OK to use blueprint Y, and blueprint Y would explicitly agree that it is usable by build plan X.

I think this is good in terms of preventing abuse (and preventing surprises). The obvious drawback is that it is very inflexible. However, I think this selection mechanism may not need to be very flexible. Here are some use cases I can come up with:

  • You want to add or remove hosts from a pool. You can already do this through Almanac. With virtualized blueprints in the future, this is automatic.
  • You want to move a lot of build plans from static allocation (Almanac) to dynamic allocation (EC2). It seems OK/reasonable to do these somewhat individually so you can test them? You get a list of the plans you have to move anyway by seeing all the authorized plans. You probably shouldn't have hundreds of plans (parameterize a smaller number of plans instead?), and this doesn't seem terrible overall?
  • You want to write a build plan today and automatically add more host types later without re-authorizing (you do Linux builds today, but know you'll do Windows builds in the future -- or use only EC2 today, but know you'll use a mixture of EC2 and GCE in the future). We could write a ProxyBlueprint which just acts as a collection of other blueprints of the same type.
  • You want to change where merges happen for a lot of repositories, and you have like 300 repositories, and merge config is still per-repository. This one seems somewhat legitimate (more reasonable to have 300 repos than 300 build plans) and a bit annoying. Maybe merge config shouldn't be per-repostitory (e.g., default application-level merge config)? Maybe CLI tools? Maybe batch edit? Seems tractable and rare.

Some of these are a little cumbersome but none seem unmanageably bad to me.

@hach-que, are there cases I'm missing where double approval would pose an unreasonably high maintenance burden in your use cases?

(This mechanism would not replace a future ability to say "give me a host with attributes X, Y, and Z", it would just control which blueprints are eligible to be used to satisfy the request.)

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added a project: Drydock.
epriestley updated the task description. (Show Details)
epriestley added subscribers: epriestley, hach-que.

@hach-que, are there cases I'm missing where double approval would pose an unreasonably high maintenance burden in your use cases?

We've got about 40 build plans at work, even with parameterization? I have 77 build plans with parameterization on my personal instance. Editing the Lease Drydock build steps on all of those plans is a significant amount of maintenance.

I'd much rather have some sort of "Blueprint Group" which Harbormaster build steps can opt into. Then we can just change what blueprints are in what blueprint groups rather than having to reconfigure all of the build plans.

Can you walk me through why you have so many build plans? I understand that you might need one per, say, technology stack or team, but I don't understand why you'd ever have 100 unless you had a very large number of teams/projects, in which case I'd imagine it would be very rare to want to reconfigure any meaningful fraction of them at once.

I'd much rather have some sort of "Blueprint Group" which Harbormaster build steps can opt into. Then we can just change what blueprints are in what blueprint groups rather than having to reconfigure all of the build plans.

Yeah, this is the "ProxyBlueprint", although "BlueprintGroup" is probably a better name.

So here's like a snapshot of a few build plans:

pasted_file (821×501 px, 112 KB)

A few of these (All Platforms) are build plans that group other build plans (they call "Run and Wait for Build Plan", but most of them have Lease Working Copy in them.

Each platform requires different build steps. Builds on a Mac host build with "/Applications/Xamarin Studio.app/Contents/MacOS/mdtool" build Protogame.<Platform>.sln. Builds on a Windows host build with C:\Windows\Microsoft.NET\Framework64\v4.0.30319\MSBuild.exe /m. Builds on a Linux host build with xbuild.

The publish variants of those platforms perform additional build steps, including packing and publishing the artifacts. Those commands are also host and target platform specific (e.g. mono Protobuild.exe --pack . Linux.tar.lzma Linux ${?publish.filter} and mono Protobuild.exe --push /srv/api_key.txt Linux.tar.lzma ${publish.url} ${buildable.commit} Linux ${publish.branch})

The Protogame build plans there are very similar, except that they also run tests before they publish. There are also build plans for the "templates" that people can use to start a new Protogame project.

Here's another snapshot:

pasted_file (376×353 px, 42 KB)

Again, these all build differently for the different platform targets.

If you were telling a human how to do the MonoGame builds, you'd pretty much point them at a wiki document with like 10 different sections, each of which had different steps and build commands on it and discussion about platform specifics? And they'd go through 10 similar-looking but ultimately separate build processes?

Most humans would load up MonoGame in their installed IDE, and wouldn't need to worry about the command-line tools to do that build. For humans it's "Protobuild.exe --generate" (it assumes current host platform, but users can pass "iOS" as an argument to target mobile platforms for example), and then opening the .sln file in whatever the default IDE is.

So both the ".sln" file and Harbormaster have essentially separate copies of the same platform-to-build-steps ruleset? Why isn't it "Protobuild.exe --generate" for Harbormaster, too?

Harbormaster also does Protobuild.exe --generate, but it has to run a different command line build tool after that depending on the host platform. Normally the graphical IDE just hides away that command invocation from a regular user.

The publish steps also have platform-specific commands that a user won't run, because users don't publish the resulting builds to the package repository (they don't have access for a start).

Like if I wanted to build MonoGame on Windows as a user, from the command-line (which you would almost never do, you'd use the IDE), I would run the same steps as Harbormaster.

Broadly, since I'd expect humans to sometimes run all the things that Harbormaster runs (to test, debug, profile, extend, etc., them), I'm assuming that this per-platform complexity should usually be pushed into projects so you can give humans one build process instead of separate per-platform build processes.

That is, you could theoretically choose put this complexity in various places like Harbormaster, or in an external build system like Jenkins, or in the project itself. Putting it in the project itself seems most desirable to me: it gives humans a single command to run and it puts all the build logic in one place under version control. It also gives you one encoding of the ruleset, instead of one in Harbormaster + one on a wiki, or one in Jenkins + one in an XML file or whatever, or one unobservable implicit one in your IDE and one attempt to do all the same stuff elsewhere. I would also expect explicitly encoding the rules in the project to reduce the number of false negative build failures by preventing version mismatches (build configuration vs project state) and DRY issues (slight differences in how the different copies of the build rules are encoded / forgetting to update the wiki when you update Harbormaster).

Maybe this assumption is mistaken in practice, or it's desirable in theory but rarely achieved in practice. I know at least two other projects have a high apparent level of build system complexity (T9352, T9478). I don't understand these cases in detail, but it's possible that many installs will realistically want to encode a very high-complexity ruleset into Harbormaster.

Providing some level of reassurance, I think both Travis and CircleCI have relatively little ability to express complexity in their build configuration, but are widely used. This suggests that projects with little build system complexity exist and that low-complexity build tools are sufficient to accomplish meaningful work.

Anyway, even with 100 build plans I think this is probably manageable with a BlueprintGroup / BlueprintProxy sort of thing, and those should be reasonably straightforward to implement and aren't problematic from a policy/permission perspective.

Also, if we'd had this from day one, even if you didn't have blueprint groups, those 77 build plans would all point at the right blueprint and it would be easy to have a CLI tool say "replace all authorizations for HostBlueprint X with authorizations for BlueprintGroup Y".

I think largely the issue with that is that in this case the build system isn't controlled by me or open source. On Windows, the Microsoft-controlled MSBuild tool is used to build code, and on Linux / Mac the open source tools are used to build code. We can't replace those tools with something universal without also sacrificing the ability to open projects in the IDE. Protobuild (the project generator) goes a long way towards generating different project formats for different platforms, but it doesn't abstract the different build tool paths or command invocations.

Yeah, that makes sense. Neither Travis nor CircleCI appear to support Windows at all, nor does Codeship, nor did ship.io before shutting down. And all of them just build off GitHub. So I think these are all very low-complexity build systems from the Harbormaster / Drydock / Jenkins / Bamboo perspective, but it makes sense that a lot of consumer web software doesn't need more than "build master off GitHub on linux".

Just a question from an observer of this conversation; @hach-que - wouldn't it be possible for Protobuild itself to use a script written in a widely available scripting language, like PHP/Ruby/Python/Bash which does the platform detection, thus allowing it to have pretty much the same build command for all platforms? Or perhaps even wrapping it in Make and require MinGW/Cygwin on Windows?
This is touching on what Google is doing with Bazel and what Facebook is doing with Buck, but taking it even further and having the same tool for all teams and technologies as well.

We're sort of looking at the same problems sans Windows and for us it does seem more attractive to have the build logic encoded, as @epriestley says, in the repository itself in a sort of wrapper script. Though, as you mention, each IDE for each platform would still be using a different build than the scripted one, but that seems to be nothing we've been able to escape. IDEs add a lot of funky command line parameters to the build commands, and even tools like Jenkins adds stuff to the maven command that you don't ask for when building java. (Another reason we like Harbormaster)

Protobuild is written in C#; it could abstract the build command to call, but I've avoided doing that because I don't want to hide this invocation. Too many build systems wrap commands which then wrap other commands and I always find it painful to pull them apart again to find out what's going on. So instead, Protobuild just generates the input solution files which can then be opened in any IDE or any build tool on the given platform.

After giving this some more thought, the "I have to update lots of build plans" issue might be negated when Harbormaster supports embedding build plans (not just calling them). When that happens I could define a Lease Windows Host build plan with a Lease Host step with the right configuration, and then just ended that Lease Windows Host build plan where needed.

Also, I have been running into the "build ran on the wrong machine" issue more often lately, because the custom attributes support I have is only a "does this blueprint have all my requirements" check; older build plans that don't have a requirement configured will match any blueprint, which isn't what I want. So I think some kind of whitelist mechanism for leasing hosts from the Harbormaster side is desirable, we just need to make sure it doesn't become a maintainance burden (and I think embed build plan would solve this among other things).

Yeah, this is still feeling pretty reasonable to me after thinking about it for a day or two.

It will add a bit more work upfront, but I don't think we're backing ourselves into a corner: depending on what we hit, we can make Harbormaster more flexible (as you describe), Drydock more flexible (with BlueprintGroups) and/or these authorizations themselves more flexible (I think this is least promising, but it's technically fine to support automatic authorizations or authorization rules like "any operation of types X or Y can use this blueprint without explicit approval" -- just potentially a lot of complexity to manage in the UI). This feels like a lot of room to find good attacks on problems as they arise.

I think it would even be OK to let you migrate authorizations between blueprints as long as you can edit both blueprints. That is, suppose:

  • User A creates Build Plan B and says that it should use Blueprint C to obtain leases.
  • User A can't directly edit Blueprint C: it's maintained by someone else (say, it is "buildpool1.mycompany.com" and maintained by the Ops team).
  • Ops approves the plan to run on the tier after verifying that the use case is suitable.
  • Some time later, Ops wants to split the pool apart.

I think it's OK to let them move the approval from "Blueprint C" to "Blueprint D" as long as they can edit both blueprints. In the general case, they can already change "Blueprint C" to have a different effect (or break it, or disable it), and "User A" has already trusted them to take care of it. An attacker who compromises an Ops account can't (as far as I can imagine) do anything by moving authorizations that they can't already do by editing blueprints: there's no obvious difference between moving the authorization from "Blueprint C" to "Blueprint D" vs editing "Blueprint C" to just use the same Almanac host pool that "Blueprint D" does.

I'd want to wait to actually build this until we hit a good use case for it, but it seems viable to me at first glance and makes me more confident that we have enough room to work with to find solutions to problems as they arise.

The biggest technical issue is that I'm not sure how to write a v0 of this control that isn't either like 99% unusable or 99% copy/paste, but I think "Authorizations" side in Drydock is straightforward so I'll probably start there.

I'm broadly pretty happy with this after using it for a bit, at least for use cases we have today. It does feel a little bit labor intensive, but not overwhelmingly so, and all the stuff is pretty well linked up so once you get the hang of it you're almost always only a click away from wherever you want to be to do the next thing. It's also reassuring to be able to go to a blueprint and check what might be using it at a glance.

I think the changes I still want to make before closing this are:

  • Harbormaster build steps don't have a "view" screen, so there's no visual warning that you have unapproved authorizations when creating a build plan. I'd like to either give them a view screen (leaning this way), or give them a separate callout, or do both.
  • Putting a stronger callout on the element itself when requests need approval is probably desirable, particularly for onboarding.

Then I expect to make these changes eventually, but not in v1:

  • Write some kind of BlueprintGroup.
  • Generally expand tools for dealing with a lot of blueprints and objects that use blueprints (e.g., CLI tools or "move authorizations").

Can I make it look better, or something?

It's just this thing, basically:

Screen Shot 2015-10-14 at 12.48.48 PM.png (69×374 px, 10 KB)

Sometimes it looks like this:

Use Blueprints: (!) Blueprint 1: Secure Build Hosts

...with a purple icon, which means "You need to go authorize this blueprint.". You get a little hint if you mouseover, but it might be nice to give it an explicit callout:

Use Blueprints: /!\ Some of these blueprints are not authorized yet!
                (!) Blueprint 1: Secure Build Hosts

...just to make it harder to miss. That's the extent of my grand ambition here, though.

I think the only part of this that's particularly designable is that little "Operations in Progress" header UI when a revision is landing, but I think I need to get the actual information populating correctly first and then we can figure out what it should look like. I suspect I can get 90% of the way there with a standard ObjectListView or something like that, though.

fyi I ended up implementing Protobuild.exe --build and a bunch of other functionality (like support for build scripts), which negated the need for having separate build plans for each platform