Page MenuHomePhabricator

Allow individual Harbormaster targets to be restarted
Closed, InvalidPublic

Description

This would assist in scenarios where a build plan might take 4 - 5 hours to complete, and a step towards the end fails. We need the ability to restart an individual step so that we don't have to re-run the first 4 hours of build.

This really means that in addition to the "failure steps", "success steps", "final steps" we discussed in IRC, we probably also need "start steps"; that is, steps that run when the build is started, and when any build step is restarted. This is where you would place "Allocate Host" build steps, so that when you run the steps that failed, they have the required host resources.

Event Timeline

hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Harbormaster.
hach-que added subscribers: hach-que, epriestley.

I don't think we can support this, per se, since there's no way to restore the state of the world to how it was when the target started. Resources may already be freed, etc. In some cases this could be mitigated, but the base case of "every failed target holds it resources forever, just in case someone wants to restart it" seems clearly unworkable.

I think we can accomplish this with sub-builds or otherwise through builds steps that trigger additional builds, although if they're sub-builds then resolving result actions might be messy.

Another attack on this would be to build thing that let targets cache, so the whole build can be re-run but all the successful targets can just hit their caches and complete quickly. Harbormaster would only need to know enough about this to have some kind of "destroy caches" button or know which steps to skip when running in "really restart everything" mode.

Also, a 4-5 hour build seems extraordinarily long. Is this just 200M lines of code that take 5 hours to compile, or a 15-minute compile plus four hours of tests?

I think this is best solved with D9807, where you move the parts of the build you want to be able to restart into a separate build plan, and then you invoke that build plan from the parent. This allows each build plan to encapsulate it's own resources and artifacts (like Allocate Host) and we don't have to worry about any resource holding or nonsense like that.

Realistically the build for us is more like 10 hours, from code build, tests, packaging and deployment. There's a number of places where we should be able to parallelise the build, which is one reason for building D9806. There's a few processes we could run in parallel (which Bamboo can't accurately represent in their model) which should shave off another few hours of build time.

I tend to agree, mostly, but I think D9807's global-namespace-on-artifacts thing isn't the best approach.

I'd prefer to see these things expressed as:

  • Build "A"
  • Retention Policy:
    • After a failure, retain artifacts for 24 hours.
  • Steps:
    • Run "make", producing an artifact visible to this build and its children called "B"
    • (1.1) Run build X (on artifacts = B)
    • (1.2) Run build Y (on artifacts = B)

This says: build a binary, then run several things with it in parallel, providing them data about it. If this build fails, retain its artifacts for 24 hours before destroying them.

You could then individually restart 1.1. or 1.2 for up to 24 hours. After that, the artifacts might get GC'd and you'd have to restart the whole build.

We could probably put a retention policy on target state and actually let you restart targets, but that feels messy and complicated and hard to understand, and probably unnecessary.

This probably solves T5803 too.

My main concern with retention policies on build failures is that it will increase costs when there are Drydock resources being held open by artifact leases on services which cost money (such as EC2).

If the artifact is literally a binary on disk on that server, and you want to be able to restart part of the build from after that binary is built, I think we have no choice?

By having artifacts only exist for the life of the build, you end up designing build processes which need permanent "artifacts" to push and pull those artifacts to an external storage system such as S3. The storage costs of S3 are much cheaper than keeping an EC2 instance online.

Keep in mind you also can't rely on resources existing until Drydock closes them; systems such as EC2 provide no such guarentees for machines (and spot instances you should expect this to be the case).

Any system designed here needs to be able to gracefully handle the scenario where a resource has been closed before the expiry period; unfortunately for EC2 there's no notification, so you won't know the machine was terminated until you explicitly check it's status through the API.

Realistically, your use case is just way outside what Harbormaster is targeting or what I consider reasonable for a general-purpose build system to have in its core set of capabilities. It is not going to support these sorts of things for a very long time (years and years).

Paying an engineer to wait for a build is about 1000x more expensive than running an on-demand instance. It is difficult for me to imagine that any but a handful of extreme edge case builds actually make economic sense to run on spot instances. For any normal build process, you're saving a few dollars on computing resources and torching a huge pile of cash on salaries. I don't plan to ever accommodate this case gracefully.

EC2 spot instances are only terminated if the market price exceeds the bid price. If your bid price is the price of a normal non-spot EC2 instance, then it's rare to have instances terminated, as the only time that occurs is if someone is paying more for spot instances than non-spot instances, or if someone consumes all of the available EC2 instances of that instance type.

While ever the bid price is higher than the market price, you only pay the market price. This means a reduction in savings of up to 75%, and frequently around 50% at least.

In the scenario where resources are kept open for 24 hours using normal non-spot instances, we'd be paying $9.264 per resource, per day. We normally have around 20 resources open, so that translates to around $185 a day. In the scenario where resources are not kept open, and spot instances are used, we'd instead be paying around $3.86 a day for all of those resources (per hour cost x 20 x 50%). Over a month it would be $5550 vs $115.

The cost doing this in terms of engineer time is that when a build fails, you have to wait an additional 5-10 minutes for a new instance to start. At most, this is around $10 of engineering salary for each build that fails, and this isn't considering the fact that an engineer can be doing something else while the new resource is being allocated. (They still have to wait for the build itself to complete, since they're restarting it in either scenario, and hence I don't factor that cost into the comparison here)

The goals here don't align with upstream, so I've consolidated the general feedback from @epriestley into T10870.

Since the original discussion on this task, we've moved away from dynamically scaling agents to just having a single, i2.2xlarge agent which handles all builds. Thus the design outlined by the upstream here is now compatible with our build infrastructure, and we don't need to worry about Drydock holding open resources that incur a cost.