Page MenuHomePhabricator

Release Server / Workflow app / Future of Releeph
Open, Needs TriagePublic

Assigned To
None
Authored By
avivey
Oct 7 2015, 10:19 PM
Tags
Referenced Files
F867295: pasted_file
Oct 7 2015, 11:27 PM
Unknown Object (File)
Oct 7 2015, 10:46 PM
Unknown Object (File)
Oct 7 2015, 10:46 PM
Tokens
"Love" token, awarded by Sam2304."Love" token, awarded by Luke081515.2."Love" token, awarded by tycho.tatitscheff.

Description

This text is written after much discussion in this task, so some terms may have changed.

We're looking for a Release system, and hope to fit it into Phabricator.

We have 3-4 different Release Flows, and the whole process is manual. The "Apps" release flow has about 40 Android apps.

Primary use-cases for Release Server:

  • Codify and trace release procedures.
  • Answer: When had Commit X hit production / some deployment group? Is commit Y in production now?
  • Make Release Notes easy to build (List changes between Releases)

Example Release Flows:

Pull Requests: I love GitHub and hate Phabricator, but I work at a company that has forced me to use Phabricator. I curse the Phabricator upstream daily. I work on a team (the "Backend Server" team) where everyone rightly feels the same way I do. We only want to use Pull Requests. These are the One True Way to build software.

  • I create a Product called "Backend Server".
  • I create one Release called "master".
  • Everyone pushes their local feature branches into the remote, then make "Release Change Request" (aka "Pull Request") to have their changes pulled to master.
  • When a request is accepted, Phabricator automatically merges it to master. Just like GitHub!
  • We reconfigure the header to say "GitHub (Bad Heathen Version)". For a time, there is peace.
  • After a while we add a "Run Tests" step to the Product, triggered "When a merge is requested". This is better than triggering on commits being pushed because we love pushing our local code into the remote in many different states of brokenness, retaining checkpoint commits, etc. But this is acceptable and we stop merging stuff that breaks the build.
  • A while later we add a "Deploy to Production" step to the product, triggered "When Deploy is Clicked".
  • Eventually we move that to "When Release is Updated" so that the thing deploys after every pull request is merged.

Facebook-Style Cherry-Picks / Phabricator-Style Stable / Backporting: I am the Phabricator upstream and have a long-lived stable branch.

  • I create a "Phabricator" Product, and an "Arcanist" product and a "libphutil" product.
  • During the week, after fixing a bad bug I merge it into "stable" using the release tool ("Release Change Request" for a single commit). This helps keep track of what got merged.
  • Maybe "stable" is a single Release or maybe we cut a new one every week. Probably the former at first and then the latter once we get bigger.
  • Every time "stable" gets updated, Harbormaster starts rolling it across all servers

Binary Release: My Release is sent to users in a box via the Post [Installed on my Enterprise servers]

  • I create a new numbered Release every Wednesday, cut from master.
  • Harbormaster compiles the whole thing as "3.11 RC1" and installs it on some Test environment
  • QA runs some tests, finds some bugs
  • I make a new "Release Change Request" to cherry-pick single commits to fix the bugs. Rebuild and deploy.
  • I Freeze the Release, not allowing any more changes. If we finds more issues, I create a new Release.
  • HM Builds "3.11 Final"
  • HM asks for QA to sign off the release, and then automagically sends it to the movable-press-company [Starts rolling servers].

Details / Plans

We'll be taking some elements from Pipelines (https://github.com/PageUpPeopleOrg/phabricator-pipeline), and mostly use Releeph and Harbormaster.

  • A "Release" is the object currently known as Releeph Branch. We'll rename it, augment it with some more information, and maybe detach it from vcs branches.
    • A Release is a HM Buildable
  • "Release Plan"/"Product Line" is the template for a Release. It will define:
    • Several HM Build plans, to run at different occasions (Similar to Pipeline): Triggers for Cut, New Change, Update, Deploy, etc.
    • Maybe instructions about "how to cut", "how to version"
    • In my example, all 40 apps will use a single "Product Line" template for their releases
  • A Release may have several Artifacts - these will be either Files or Phragments or HM Artifacts.
  • A "Workflow" is just a Harbormaster Build.
    • We'll add a "Wait For User Approval" step, to allow tracking manual steps (Probably end up using Quorum T9515).
  • Release Change Request object is essentially Releeph Pull Requests, but with some pending changes.
  • Probably rename "Releeph" to something else, or maybe write a new thing (Depending on Releeph's code state).

Revisions and Commits

rP Phabricator
Needs Review
D21792

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is there an unavoidable cost to keeping QA environments up for multiple days?

We had a feature like that at Facebook ("Sandcastle") but the cost was negligible (approximately ~1GB of code on disk, except that it used hardlinking to reduce that to nearly nothing) so it just spun them up automatically and kept them around for a long time. I'd imagine this is desirable in general and I presume it is usually realistic, although it sounds like that's not how things work now. Is there a straightforward technical path available there which is maybe just not worthwhile at the current scale, or are there factors which make it very difficult?

Sorry, I don't quite follow... (It's early and I'm still somewhat jetlagged... I'm in Vegas for AWS re:Invent at the moment).

Haha, no problem. I just mean:

Currently we have developers do this manually, which means that the instances sit dormant for hours to days until the QA person is able to test it.

Why is it a problem that the instance sits dormant for days? That is, I'd expect that a QA environment should not need a whole machine to itself, so one VM can hold thousands of QA environments and the cost per environment should be very small (a few cents per day?). Is there a strong technical reason that it's impossible to put 1K+ QA environments on a single host?

The cost is probably small at the moment because we are only talking about an application server, but in the future I'd expect a QA environment to consist of:

  1. Application server
  2. Database
  3. Redis
  4. Memcache
  5. Elasticsearch
  6. RabbitMQ
  7. Various internal microservices

It's possible that many of these pieces won't exist in the QA environment, or will exist as shared services, but it expect the cost to quickly add up.

Unfortunately, I don't think it's easy to just run thousands of environments on a single host. In the general case, the application assumes that it isn't sharing resources with any other host. I expect that bad things might happen if we tried to deploy multiple instances side by side. It's also not possible in the general case because a QA environment might require changes to multiple repositories (there may be a dependent diff which changes the puppet code, for example).

Generally though, I'd rather keep these environments close to production where possible and, as such, I'd prefer to deploy a greater number of smaller instances than larger shared instances.

My concern is not primarily with cost but rather with process. The current process requires developers to perform build steps on their local machine. Our build process isn't overly complex, but I'd rather that staging environments are deployed with the same artifacts (or similar) that would be deployed to production rather than relying on developers. This would also help enforce that what is being tested is exactly what is being reviewed (there are no local changes which were made by the developer after submitting the diff).

To clarify on my terminology, my eventual goal is to have the following environments:

  1. Dev
  2. Test / QA
  3. Staging
  4. Production

Each environment progressively approaches to production. The Dev environment would essentially be a (ideally local) environment with no external dependencies. The database would be localized and most external services would be disabled or mocked out. The QA / Test environment would have some extra pieces, specifically external services. The staging environment would basically be a replica of production.

@joshuaspence: I have practically the same desire, but I think HM and Drydock are moving to answer this need ("Build be a complex environment based on this diff").
I'm not sure how this fits into the "Release" and "Workflow" use-cases though?

Well, it is essentially releasing into a non-production environment. We may then want to promote the artifact from the test environment to production.

I would want to use the same (or a very similar) workflow for deploying across environments.

Releases would be explicitly marked as being non-production.

Keep in mind those test releases will be based off diffs, not commits in that model, so we almost certainly can't re-use any of the built artifacts since they aren't integrated.

In this case I'd almost advocate for a "non-production" and "production" set of environments, so a non-production release can never be accidentally pushed to production.

The name "Pipeline" brings "Data Pipeline" to mind for me, possibly because AWS has a product called "AWS Data Pipeline", although it looks like no one else particularly likes "Releeph" either.

Maybe "Conveyor"?

I'd be OK with "Release" too, but that would somewhat preclude us from having an object within the application called a "Release", and I suspect we might want to rename "Branches" to "Releases".

Maybe "Culvert", although that's sort of an ugly, odd word.

maybe "Produce"?

("Pipeline" and "Conveyor" both sound like ETL to me. didn't get "Culvert" and "Chuckr").

A culvert is just a big drainpipe.

chuckr is Chuck Rossi, the Facebook release engineering lead of fame, legend and renown.

As an engineer at Facebook, if you saw "chuckr mentioned you in IRC", pants were ruined.

So, here's what I understand:

  • Harbormaster will learn the "Wait for user approval" step. Eventually it will use the Quorum UI (Assuming it's coming), so a quick fix can be base on Policy. This should handle more-or-less all the use-cases of "workflow", so we don't need that name any more.
    • This will also answer "Deploy Workflow", at least for now.
  • Release object will be mostly the existing "Releeph Branch" object, with some UI parts from Pipeline.
    • Release will be as immutable as possible
    • Release will be Buildable, and based on a either Commit or Diff (Revisions are mutable)
    • Release will have (a single?) HM Build
    • Release will have "artifacts" (HM Artifacts? Files? Phragmants? TBD)
  • We'll need a Release Plan object, similar to how Build Plan relates to a Build
    • It will hold some parameters, to allow me to release 40 apps in one go.
    • It will reference the Build Plan for the Builds
  • Releeph might be renamed or re-written, depending on how brave we are.

ps: I obviously can't vote against "chuckr".

Artifacts: Either as Files or Phragments, they can be:

  • Attached to a Release with some flag / slot (A Release expects some specific artifacts)
  • Trigger a build for "validation" (if uploaded from outside the system)
  • We can feed them via the HM Build, as in "expect the file to be somewhere, then create a HM Artifact". This sounds a big convoluted.

Release will be as immutable as possible

I still want to support a pretty-much-exactly-like-GitHub pull request workflow (where the release is mutable and never closes, e.g. master) and a Facebook-style mutable merge model (where the release is mutable and closes after a period of time, e.g. production-20151101) in this tool, so I'd expect Releases to retain full mutability, just not actually be mutated in your environment.

Release will have (a single?) HM Build

I'd expect there to potentially be a bunch of builds triggered per release in the long run.

with some UI parts from Pipeline.
We'll need a Release Plan object, similar to how Build Plan relates to a Build

Minor technical distinction, but I'd expect Products to pick up the "run builds" parts of Pipeline, rather than Branches/Releases directly. So you'd configure a Product like:

  • When a new Release is cut or updated, run plans: [build artifacts]
  • If this is a mutable release, when a merge is requested, run plans: [(none)]
  • When (hand waving here) Ops clicks the "Deploy" button, run plans: [deploy to staging]

So the "Release Plan" would probably just be a batch way to say "Click the 'deploy' button on these 40 releases"?

I think they need more than one "deploy" button, and maybe we need to introduce the idea of a "Target Environment" or something, so the actual button is "Click the 'deploy to staging' button on all these releases", and then the next screen says "3 of those releases have no way to deploy to staging, deploy the other 37?".

I'm thinking of "Release Plan" as a more generic Product:

  • Instructions on how to cut
  • Instructions on which HM build(s) to run when
  • List the expected artifacts

For my 40 apps, I'd like to have a single Release Plan, which are parameterized over "name" and "repository", and somehow invent "version name/number", "cut commit hash". I can do 40 calls to "make Release form Release Plan", but I don't want to have 40 copies of essentially the same "Use master and run build X" information.

The "sort-of-like-github-pr" flow of master is essentially a hook for master being updated? Or more "pull request" where "deploy" means "merge to master and close this"?

Specifically, here are three workflows with different levels of mutability that I'd like to support:

Pull Requests: I love GitHub and hate Phabricator, but I work at a company that has forced me to use Phabricator. I curse the Phabricator upstream daily. I work on a team (the "Backend Server" team) where everyone rightly feels the same way I do. We only want to use Pull Requests. These are the One True Way to build software.

  • I create a Product called "Backend Server".
  • I create one Release called "master".
  • Everyone pushes their local feature branches into the remote, then makes Pull Requests to have their changes pulled to master.
  • When a request is accepted, Phabricator automatically merges it to master. Just like GitHub!
  • We reconfigure the header to say "GitHub (Bad Heathen Version)". For a time, there is peace.
  • After a while we add a "Run Tests" step to the Product, triggered "When a merge is requested". This is better than triggering on commits being pushed because we love pushing our local code into the remote in many different states of brokenness, retaining checkpoint commits, etc. But this is acceptable and we stop merging stuff that breaks the build.
  • A while later we add a "Deploy to Production" step to the product, triggered "When Deploy is Clicked".
  • Eventually we move that to "When Release is Updated" so that the thing deploys after every pull request is merged.

Facebook-Style Cherry-Picks / Phabricator-Style Stable / Backporting: I am the Phabricator upstream and have a long-lived stable branch.

  • I create a "Phabricator" Product, and an "Arcanist" product and a "libphutil" product.
  • During the week, after fixing a bad bug I merge it into "stable" using the release tool. This helps keep track of what got merged.
  • Maybe "stable" is a single Release or maybe we cut a new one every week. Probably the former at first and then the latter once we get bigger.

Binary/Build-style Releases: Releases are versioned and build real binaries and immutable. All the same stuff above, except there are never any pull requests or "on pull request" actions. Maybe there's just an option to disable them in the Product.

"Pull Requests" are the existing "Pull Requests" in Releeph. They're literally just pull requests. Releeph today is like 90% about implementing pull requests and then 10% about surfacing pertinent details about those requests prominently so chuckr and peers can bulk process hundreds of them per day.

(The pull requests are just useless outside of the Facebook workflow because they can't merge and the "you can do hundreds of them really quickly" aspect isn't useful at less-than-Facebook scales.)

Specifically, you:

  • Go to a Branch/Release page.
  • Click the "New Pull Request" button.
  • That goes into the queue for the Branch/Release.
  • Whoever owns the Branch/Release can approve/reject it.
  • That's the end of the workflow today since Harbormaster didn't exist and none of T182 was planned. Last I knew, Facebook completed the rest of the workflow with custom arc do-a-bunch-of-git-stuff extensions.

I think this is totally compatible with immutable Branch/Releases, we just might need a way to hide/disable the workflow and hide/disable any configuration options that are specific to it ("On Pull Request", etc).

"Binary" style releases might actually not be as immutable as I'm hoping; A Release Candidate might start it's life as a cherry-pick style release, and then be frozen at some point. If we're building binaries each time a new cherry-pick is picked (To test in Staging, e.g.), we might call them all "3.11 RC3", and when finalizing, build a new one as "3.11".

If we find a bug after freezing, I'd like to think that we'll start on a "3.12 RC1".

This is setup-specific, so "frozen" might just be a state on the Release object (And HM plan will be allowed to "Freeze Release"?)

Yeah, that seems reasonable to me. You can already "Close Branch" today which is effectively the same action as "Freeze Release". I'm broadly comfortable with moving Harbormaster in the direction of having richer application awareness and interactions, although we'll have to think a bit about what happens when you "Freeze Release" in a build plan and run it on a commit (does it fail? get skipped? configurable?).

For the "40 applications with similar plans" case, I don't think all the Releases under a Product necessarily need to have the same repository, but then you're still looking at some sort of API/script action to do the actual creation of releases (a Product could be more like a "Product Line" in that case). But maybe that's fine, at least for now.

For example, I think Phabricator, Arcanist and libphutil probably have identical Product rules except for which repository they come from, so there might be a use case for that even in the upstream.

@hach-que - I'm getting ready to public-ize this task (After updating the description).

avivey renamed this task from RFC: Release Server / Workflow app to Release Server / Workflow app / Future of Releeph .Oct 12 2015, 8:20 PM
avivey removed avivey as the assignee of this task.
avivey updated the task description. (Show Details)
avivey added a project: Harbormaster.

Not sure if any of this is relevant to T8297, but everyone loves walls of text!

avivey changed the visibility from "Subscribers" to "Public (No Login Required)".Oct 13 2015, 3:53 PM
avivey added a project: Restricted Project.Dec 23 2015, 1:15 AM

FWIW, at the WMF we make releases that involve hundreds of repositories: one for mediawiki, plus one for each mediawiki extension that we host. I don't think it is an ideal situation, and it causes me all kinds of grief, but it's the current state of affairs. So there is at least one potential use case for a release that encompasses a snapshot of multiple repos.

So I was thinking about software components recently and one of the issues I've had with both Jenkins and Harbormaster is that builds of one repository aren't aware of the builds from another repository. This is a common scenario in the software I build:

  • Repository A is some software in source form
    • Each commit of repository A gets built and published into an external package repository. The package is literally versioned by the Git commit hash, and the external package repository also has a copy of the branch pointers.
  • Repository B is some software in source form
    • It depends on https://packagerepo/SoftwareA or w/e, and it tracks the master branch of that software.
    • Each commit of repository B gets built and published into an external package repository, etc.

The problems is when someone makes changes to both A and B, and commits them one after another. In this scenario, one of two things happen, either:

  • Repository B incorrectly uses an older binary version because Repository A hasn't finished building yet, or
  • Repository B clones the source of Repository A and builds it from source form because the binary isn't ready yet

Both of these options are undesirable, and I'd rather have repository B wait until the dependency from A's build is available.

I thought of some sort of componentized-build server that layers on top of Harbormaster. Instead of building arbitrary buildables however, you set up "components" in this system.

Each component tracks one of more repositories and creates a Harbormaster buildable when one of those repositories changes. Components also have identifier URIs. In addition, Harbormaster build steps can push dependencies back to the component, to make it aware of other related components and dependencies.

So in this case you'd have component A tracking repository A and component B tracking repository B. When something from B starts building, it has something like "Scan Dependencies" build step which looks at the contents of the working copy and picks up dependencies from the package management files on there (we'll need to make this extensible to support different package formats or something? maybe we can just make it run a command with the expectation the command returns a JSON blob?). So this step from component B would post back something like "component A's package URL at version XYZ" or "component A's Git source URL at version XYZ". Then the "Scan Dependencies" step would wait for until there's a Component A built with that version, or wait until Component A's build stabilizes (in the scenario where it's tracking a mutable pointer like master). In order to resolve the ambiguity around "is master up-to-date according to Phabricator", we can make the stabilization check request the repository is updated now (in the case of imported repositories), and wait until all commits are imported.

As per the other things in this task, you'd be able to have multiple phases / stages (manual or automatic) that trigger separate Harbormaster builds, and you'd just flag one of these stages as "the component version is now considered published". We could extend this and have like "the component version is now considered published in XYZ environment" or something, and then "Scan Dependencies" could wait until the component is available in a certain environment too? That would allow us to have like "this package is available in the NuGet repository" or "this AMI is now available in the AMI registry" as different phases / environments?

I don't know whether this architecture sounds useful to anyone else?

[/end ramblings?]

One thing that happened locally wrt "Product Lines" and 40 apps:

  • RelMan wants to think about the 40 apps going out at the same time as a single thing, even if they have slightly different code changes.

For instance, they want to cut them all at the same time, and deploy them all to the staging environment/prod in one go, etc.
ATM, that means that we have a single Release with lots of repos in it, but that might one day grow into a "Meta-Release"/"Release Bag"/"Train", which is basically a collection of releases that are managed together.

@hach-que what you describe is being used by OpenStack (a cloud management system) for their CI. They wrote an adhoc software named Zuul and the feature you describe match the description at http://docs.openstack.org/infra/zuul/gating.html It is using Gerrit (a code review tool by Google for Android) as a source. So you are not alone :-]

Reusing your example with A being for example a library and B depending on it. When you approve diff 1 on repo A and immediately after diff 2 of repo B you want to pass both diff to the buildable of B so it knows about the diff in A that is about to land. So you get:

  1. (test A + diff 1)
  2. (test B + diff 2) with (A + diff 1)

If you want to speed up the process by having the build run in parallel you will need to build (A + diff 1) twice, though in the second build you can probably skip the tests of A.

There are some gotchas:

  1. if first build fail because (test A + diff 1) has some fault. The second build has to be retriggered/updated to build against A (since diff 1 did not merge)
  1. first build pass: it lands. The second build if run in parallel will also land assuming (test B + diff 2) pass tests.

Depending on your project, when there are lot of cross project dependencies and the tests are long, it might be worth parallelizing. Else throttle and update changes depending on the outcome of changes ahead in the queue.

avivey added a revision: Restricted Differential Revision.Dec 9 2016, 10:10 PM
avivey removed a project: Abuse.

So I was thinking about software components recently and one of the issues I've had with both Jenkins and Harbormaster is that builds of one repository aren't aware of the builds from another repository. This is a common scenario in the software I build:

This is one reason it makes a lot of sense (at least for some teams / orgs) to use a "monorepo":

I've made some progress in the direction of this task in D16981 and D17020, but I've since then more-or-less lost the external pressure to implement this. I might get around to completing this eventually, but I might not.
The code in those diffs is mostly usable, but it does require some amount of local extensions to be implemented. If anyone is interested in trying it out (Or even taking over the changes), I can instruct you on how to do it locally and what's missing.

Those diffs are based on a local implementation of the full system, so it should be in a working order.

@avivey: I'm somewhat interested in this if you have any tips for getting it working locally I would like to try it out and see if I can contribute anything towards a finished extension.

To those still interested, Spinnaker (OSS from Netflix + Google) is targeting the Deployment part of this flow (everything after a Release Candidate is created).
It only handled "cloud" deployments, and offers as much complexity as you can ever want, including "Wait for human approval", Canary deployments, blue-green, etc.

It ignores what I'd call the Release part of the flow - picking which code goes into a release; Their approach is basically "your master branch should go do prod ASAP`, which I guess is what "most" people want these days anyway.