Resource allocator does not create new host resources when one is already active
Open, Needs TriagePublic

Description

Situation: We want to run exactly 1 build on 1 machine. In order to do this we tried to set the limit of a working copy blueprint to the same number of machines in the relevant almanac service. When the working copy tries to lease a new host, it leases all the builds to the same host rather than spreading the load across all available hosts.

Expected Behaviour: When at capacity k < n and a new working copy resource is requested by a lease. A new host resource is created and a new working copy resource is created on the new machine.

Actual Behaviour: All working copies are created on the first (and only) host.

Reproduction Steps:

  • Set up an installation with 2 devices, ec1 and ec2.
  • Create a new service (linux) and add both devices as bindings.
  • In Drydock, create an Almanac Hosts blueprint for the linux service.
  • On the command line, run ./drydock lease --type host twice and observe that two leases for the same host are created rather than one for each host.

Versions

phabricator b371e90364de0f8214567f3f820a3c26b7ea627c (Dec 20 2016) arcanist f1c45a3323ae20eefe29c0a22c7923fe8b151bbf (Jul 2 2016) phutil 5fd2cf9d5ddd38424a54a8fba02398d527639970 (Jul 9 2016) Local Version Bitnami Phabricator 20160725-0

Analysis: Looking at the code this seems to be the expected behaviour.

Specifically, looking in DrydockLeaseUpdateWorker.executeAllocator, $resources is initially set to the result of loadResourcesForAllocatingLease which queries which suitable resources are available and then filters them depending on how the canAcquireLeaseOnResource method is defined for that resource type. For hosts (in DrydockAlmanacSericeHostBlueprint) the method always returns true. Thus, if there is one active host, the whole body of code which initialises a new resource is skipped.

mpickering added a subscriber: bgamari.
chad added a subscriber: chad.EditedJan 23 2017, 3:30 PM

Offhand, these don't look like current / valid versions of Phabricator. They are out of date and installed from somewhere other than our source. Please see Contributing Bug Reports for what we require in a bug report.

I will use a completely clean build tonight and reproduce this but the code hasn't changed since 2015.

OK -- I made a new ec2 instance with the phabricator marketplace image to make the install easier. Updated to the latest stable and could still reproduce this. I don't know why the commit branches are saying that it is branched because I git reset --hard origin/stable on all the branches. I repeat -- this is a 100% fresh machine with the latest stable channel fetched from the phacility repos.

phabricator 23c54262caf84abdad82974212170ff24d138f4d (Sun, Jan 22) (branched from ddf82a815b9a07e901870c2f4d5b7582af7b4d82 on origin) arcanist 9503b941cc02be637d967bb50cfb25f852e071e4 (Fri, Jan 6) (branched from ade25facfdf22aed1c1e20fed3e58e60c0be3c2b on origin) phutil 10963f771f118baa338aacd3172aaede695cde62 (Fri, Jan 13) (branched from 9d85dfab0f532d50c2343719e92d574a4827341b on origin) Local Version Bitnami Phabricator 2016.30-2

Upstream doesn't support Phabricator installed via third-party images (including Bitnami). You need to reproduce the issue on either a blank Phacility test instance, or by following the Installation Guide and reproducing it there providing it's not an environment issue.

(I preemptively set the project back to Needs Information since the followup didn't involve reproduction of the issue on a valid install)

epriestley added a subscriber: epriestley.EditedJan 25 2017, 3:42 PM

Without actually reproducing this or looking at the code, I believe this isn't a bug. You're expecting Drydock to choose particular allocation strategies:

  • Greedy allocation: prefer to allocate a new resource rather than reuse an existing resource.
  • Load balancing: when several resources could be used, distribute work among them evenly, for some definition of "evenly". (You don't discuss this explicitly, but presumably would consider Drydock putting work on hosts A, B, A, A, A, A, A, A, A... to be undesirable, even though that would technically satisfy the behavior you request during the initial condition.)

These aren't universally the right strategies, so the behavior you want isn't unambiguously correct. Here are two scenarios where greedy allocation is worse than the strategy Drydock currently uses:

  • The build task is very fast compared to the cost of host allocation. In this case, running builds on existing hosts can produce results faster than allocating new hosts.
  • A real-world subcase of this is where the allocation blueprint launches new VMs in EC2. They may be much slower than the build, and they have a financial cost associated with them. You might reasonably want to allocate builds in an order like, say, "A, A, A, A, launch new host, B, B, B, B, launch new host, C, C, C, C" in this case, if you use 4-vCPU hosts as build hosts and your build spends significant amounts of time waiting on a single CPU.

Separately, Drydock currently load balances by choosing resources randomly. There are lots of scenarios where this isn't the best strategy:

  • Some jobs might run a lot longer than other jobs, so "choose the resource with the fewest leases" or "round robin" might be a better strategy.
  • Some resources may have different capacities (e.g., a mixture of "small" and "large" build hosts) so weighted load balancing might be appropriate.
  • Some jobs may do something like produce a disk artifact, and sequential allocation might be better (e.g., you want to completely fill up machine A before moving on to machine B so that chronologically similar artifacts don't get spread across a host pool) -- this one is a little bit of a stretch, but I think not wholly a work of fantasy.

That said, I think selecting a load-balancing strategy is generally not hugely important and that "random" is usually good enough.


Upshot:

  • We use a safe (ungreedy) allocation strategy by default, and currently have no way to configure Drydock to use a greedy allocation strategy instead.
  • A greedy strategy is desirable in a common case (long builds against a fixed pool of hosts), and should be available (and possibly the default, at least for "Almanac Hosts" blueprints).
  • However, the ungreedy strategy isn't a bug -- it's a valid allocation strategy which is desirable in some contexts -- and should also remain available. Other strategies should become available, too, as we learn about workloads they can support.
  • Beyond allocation strategies, load balancing strategies will also probably need more configuration in the long term to support varied workloads.

I believe you can work around this today by disabling the binding to host "A" in Almanac, running one job (which will be forced to allocate on host "B"), then re-enabling the binding. After that, both hosts will have resources and jobs will allocate randomly, which should be good enough. This is exceptionally cumbersome and ridiculous, of course (and it's possible that it doesn't even work).


Broadly, the context that much of Drydock exists in is that we (Phabricator) have ~15-ish second builds and they're great. I don't want to bake an assumption that everyone's builds take forever into Drydock, and Drydock development usually targets us first since we're the only install we can test. When a new feature gets written, the first version tends to have behavior which makes sense in the upstream context of fast builds and excess resources.

From an engineering viewpoint, this is the context that I think every project should exist in. In reality, of course, few do. After I publish my memoir ("From Abra to Zubat: How to Engineer Software Correctly") I am confident that my unique insights in chapters like "Slow Builds Are Not Good" and "Humans Cost A Lot More Per Hour than Computers Do" will usher in a new era in industry, so we only need to hold out until then.

So my point here is just that: I don't want to compromise on supporting fast builds; we have fast builds; we target us first; so defaults and initial versions tend to make sense for projects with fast builds. Since fast builds aren't common, these initial behaviors won't be right for everyone and may not even be the right defaults. I'm cognizant that we need first-class support for slow builds too, the hard-coded "fast build" option just tends to get built before the "fast build or slow build" toggle -- here, support for selecting and configuring different allocation strategies.

See also T7006: Harbormaster targets execute multiple times if they run for longer than 24 hours for builds in industry. That actual build didn't truly run for 24 hours, but I think there's a healthy dose of self-made problems in some of the build use cases we've seen. We'll support most of these and become more robust against other failure cases like bad builds outputting 100GB of data which you have a regulatory obligation to retain, but they aren't what we're targeting in the first cut.

A short comment now. Thank you for the very detailed reply.

The reason I filed this as a bug is that it seems that there is no situation where the second host will ever be allocated. Is there some scenario which can trigger both bindings in a host group to be brought online? I am trying to understand the intention behind having host groups if this is never the case.

Is there some scenario which can trigger both bindings in a host group to be brought online?

Not currently, unless I'm misremembering.

trying to understand the intention behind having host groups if this is never the case

Because such scenarios will exist in the future, and we didn't need to create these scenarios yet to do useful work with Drydock.

Although it's clear that you should be able to configure allocation to work differently than it does, it's not yet clear what the UI (or even, really, what the underlying model) should look like. I think the most up-to-date discussion is in here: D16565#195217 (second block).