Page MenuHomePhabricator

"Wait for Previous Commits" should allow you to specify build plans to wait on
Closed, WontfixPublic

Description

Currently "Wait for Previous Commits" waits for all builds on a buildable to finish.

In my scenario, I have a set of regression tests which are expected to take hours. This is in comparison to the normal build + functional tests which take around 5 minutes. I want the normal build + functional test to wait for previous commits, because this is the build plan that actually publishes things, but not wait for the regression tests (because that will mean each commit takes hours to build).

So I'd like the ability to set specific build plans on the "Wait for Previous Commits" build step and only have it wait for those build plans to finish on previous buildables before continuing.

Event Timeline

Why do these plans need to run sequentially in the first place?

The expectation is that you can regression tests without "Wait for Previous Commits".

My concern is that the regression tests will consume all of the Drydock leases, so other builds that need working copies on the host will stall until all of the regression tests have finished running.

Right now all of my build plans have "Wait for Previous Commits" so that builds for one project don't block builds for other projects due to Drydock limits being hit.

epriestley claimed this task.

See https://secure.phabricator.com/book/phabricator/article/drydock/

Drydock generally prioritizes responding to requests quickly over other concerns, like minimizing waste or performing complex scheduling. Although you can make adjustments to some of these behaviors, it generally assumes that resources are cheap compared to the cost of waiting for resource construction. ... Drydock may be a weak fit for a problem if it is bounded by resource availability and using resources as efficiently as possible is very important.

I don't know of any scenario in which physical hardware or renting virtual machines is cheap? That seems like it excludes Drydock from being suitable for literally any system where you need to run commands on machines :/

Anyway, if the intent is that the limit property shouldn't used and it's usage isn't supported, then perhaps it should be removed?

This task seems to describe expected, correct behavior of limit ("builds for one project ... block builds for other projects due to Drydock limits being hit").

Hmm, are you expecting that Harbormaster builds will only ever run for a few minutes then? The text in the Drydock documentation seems to indicate that you don't expect people to run builds that take hours.

I don't know of any scenario in which physical hardware or renting virtual machines is cheap?

I am building Harbormaster and Drydock for commercial companies, where the cost of hardware is small compared to the cost of personnel. I believe this to be true for essentially every commercial company. It is true even for us, and we don't pay salaries! We pay more for health insurance than for all development and production hardware combined. Hardware is incredibly cheap compared to personnel -- even personnel who work for free.

I am generally eager to solve problems by adding more hardware, and thrilled when I can reduce a problem to "add more hardware".

Reducing a problem to "add more hardware" is difficult. Harbormaster and Drydock are designed to be good at making this reduction.

Hmm, are you expecting that Harbormaster builds will only ever run for a few minutes then?

No. I am expecting Harbormaster and Drydock will eventually handle a very large amount of throughput, including long-running, complex jobs. But I am also expecting that hardware will generally be available to complete those jobs, and that it is more important for the system to be able to achieve huge throughput in a hardware surplus than resource-efficient throughput in a hardware deficit. When constraints are at odds, I choose better behavior in a surplus environment over better behavior in a deficit environment.

If you are very concerned about keeping costs to a minimum, Drydock's design is not aligned with your goals. By design, it has many features which consume resources inefficiently to achieve greater throughput, and will continue to gain these features in the future (for example, pre-allocation of resources it anticipates it may need). I expect these features to be desirable for essentially all commercial companies, because they translate into a large net savings by letting you spend a tiny amount of money on hardware instead of a large amount of money on personnel and make your personnel happier and more productive because builds complete faster.

(It is still possible that Drydock is the best system to choose even with these constraints, because I suspect no build systems focus on efficient use of small resource pools, as I believe no commercial companies have this problem. You might be able to build a build system on top of a system which does focus on efficient resource use, like Apache Mesos. Changes by Dropbox is a build system on Mesos, and might be worth looking at, although I do not know that it is specifically focusing on efficiency.)

It sounds like you found the Drydock article unclear on these points. How could I rewrite it to be more clear?


Constraints are not actually at odds here and there's no reason we can't add priorities to Harbormaster to improve deficit behavior without impacting surplus behavior negatively, but I plan to wait for more maturity and wider adoption of these systems before planning deficit-behavior features like this, and seek feedback from installs which have more alignment with the design goals of Harbormaster and Drydock.

It is entirely possible that I am in the wrong and many or most commercial companies are more concerned about hardware costs than build throughput, but I believe my assessment of needs is broadly reasonable based on my technical judgement and professional experience. To buttress this assessment, I am unable to find commercial build servers that advertise "low cost" or "efficient" execution of builds, while claims of "scale" and "throughput" or "parallelism" are common.

I don't know of any scenario in which physical hardware or renting virtual machines is cheap?

I am building Harbormaster and Drydock for commercial companies, where the cost of hardware is small compared to the cost of personnel. I believe this to be true for essentially every commercial company. It is true even for us, and we don't pay salaries! We pay more for health insurance than for all development and production hardware combined. Hardware is incredibly cheap compared to personnel -- even personnel who work for free.

The issue here is that making the cost comparison assumes that you can remove the cost of personnel as easily as removing the cost of hardware. It is far easier to "hire and fire" hardware on the fly and remove the hourly cost when it's not used than it is to remove the hourly cost of personnel. Or to put it another way, you're paying for engineers regardless of what you do, you can't reduce this cost without significant overhead. But you can reduce the cost of hardware easily, you just terminate instances when you don't need them.

In places that I work, we aim to keep AWS costs as low as possible, because they're something we can easily reduce by having software smart enough to reduce the usage of them. Regardless of how smart your software is, it can't reduce the cost of engineers.

It sounds like you found the Drydock article unclear on these points. How could I rewrite it to be more clear?

The documentation states:

This isn't to say that Drydock is grossly wasteful or has a terrible scheduler, just that efficient utilization and efficient scheduling aren't the primary problems the design focuses on.

But I personally consider paying for resources that you aren't using (and when it's trivial to remove the cost) wasteful. I'd suggest rewording it to explicitly emphasise that Drydock is intended to keep unused resources on standby and to potentially spend all your money on hardware doing so.

All the other build systems I know of are capable of shutting down instances during unused periods. Bamboo gives you the option to shutdown and start instances on schedules, Jenkins by default shuts down instances after an idle period and can scale down. Both of these build systems also allow you to use spot instances, and gracefully handle scenarios when spot instances are shutdown by rescheduling builds.

These features don't exist in Bamboo or Jenkins because people don't use them; they exist because there's demand for them and because it's trivial to save hundreds or thousands of dollars on instances by turning them off over weekends, or at nights or when they're otherwise not needed by demand.

From what you've described, you intend for Drydock to explicitly not to handle these scenarios, because it prioritizes having resources available and ready just-in-case they might be needed at any moment, even if doing so incurs high costs.

This task does not describe a problem that is in any way related to reclaiming unused resources or autoscaling hardware pools.

It describes prioritization behavior in a resource deficit, and how your attempt to improve the behavior by using "Wait for Previous Commits" as a prioritization mechanism was not successful.

Improving this behavior is not a priority for the upstream and not a design priority for Drydock, and it is expected that "Wait for Previous Commits" can not be used effectively as a prioritization mechanism, because it is not a prioritization mechanism.

If you believe Drydock should support autoscaling hardware resources, I agree. See T5544, which you filed.

If you believe Drydock should clean up unused resources, I agree. See T9994, which you argued with me at length in.

If you believe Harbormaster should support prioritizing builds, I tentatively agree but believe it is premature to plan this feature.

If you believe Harbormaster should support use of "Wait for Previous Commits" as a prioritization mechanism, as per this task, I disagree.

This tasks describes a deficit in being able to fairly schedule builds, that is not having builds of one project consume all of the available Drydock resources. The response here (and in general it feels like most Drydock-related tasks) was roughly "add more resources". I don't believe "add more resources" is the appropriate or necessarily viable solution to every problem here.

If you believe Harbormaster should support use of "Wait for Previous Commits" as a prioritization mechanism, as per this task, I disagree.

I don't necessarily believe that "Wait for Previous Commits" is the best solution; I'm describing a problem that I have and what I believe is a mechanism that could be used to achieve a solution to the problem that I specifically face. However, I don't know what, if any, other related problems have been reported in this area. Only you have the wider picture on the best approach for solving the problem. That's why I'm filing tasks that describe the problems I encounter i.e. "regression tests will consume all of the Drydock leases" and not just sending diffs through.

But the response and accompanying Wontfix does not come across as:

If you believe Harbormaster should support prioritizing builds, I tentatively agree but believe it is premature to plan this feature.

It comes across as, "I believe this problem can be solved by throwing more hardware at it".