Page MenuHomePhabricator

Support complex resource constraints across multiple resources in Drydock
Open, WishlistPublic

Description

Drydock does not easily handle cases like this:

  • Harbormaster wants a working copy of rX.
  • Drydock is set up to try to build this by building a "working copy cache" resource first, then building the working copy from it (perhaps because this is faster than doing a network clone in the given environment).
  • It's not currently easy to build this in a way that limits the number of active working copies per-host (say, 25 per host). It's easy to say "5 working copy caches per host, 5 working copies per cache", but this may not use resources very efficiently. It's not obvious how to say "get a working copy cache, and also 25 working copies per host", at least without putting a whole lot of coordination logic into the blueprints.

Resolving this may actually be possible and straightforward, but there's no specific recommended approach for it right now.

(It's also not clear that this scenario is a very strong driver for it? Particularly, the expectation is that working copies are recycled and serve implicitly as working copy caches. The newer WorkingCopy blueprint (which implements this more explicitly) and actual resource lifecycle (which can actually do this properly) may moot this scenario. We'll see how things work in production after T9123.)

There are various other related scenarios (maybe a resource needs 2 other resources) that likely face the same challenges, but the ground here generally feels very hypothetical for now.


Original Description

So I've been wrestling with this problem today. In our setup, the AWS EC2 host blueprint is constrained to have a maximum of 5 host resources. It will never exceed this amount.

When the working copy blueprint is leased against, often the following happens:

  • When the working copy blueprint needs to allocate a resource, it acquires a lease from the host blueprint and then saves the resource ID of that lease.
  • All future leases that happen on that working copy lease are forced to lease against that host resource ID. This is because the cache provided by the working copy is host-specific.

This causes a few problems:

  • If there are no constraints on the working copy (i.e. it is allowed to perform as many leases per resource as it wants), then the creation of those leases will cause the host resource to get overleased (beyond what would normally be desired). For example, if the host blueprint is configured for an ideal of 5 leases per resource, with a maximum of 5 resources, the working copy will bypass these settings and you'll end up with 25 leases on a single resource (instead of 5 leases on 5 resources, with each resource also having a working copy lease).
  • If there are constraints on the working copy, then it has no way to enforce host uniqueness. When it goes to acquire a host lease as part of the working copy resource allocation, there's no guarantee this will result in a new host being created. I could force it to always allocate a new resource based on a parameter provided in the lease attributes (like the resourceID and blueprintPHID lease attributes perform filtering in the latest patches), but at this point it feels like the working copy blueprint has too much control over the allocation behaviour.

Basically, I don't know how we're going to solve the issue where:

A resource on one blueprint has a 1:1 mapping to a resource on another blueprint, where either or both of the blueprints have resource constraints.

@epriestley do you have any suggestions on how we might architecture Drydock to solve this issue?

Event Timeline

hach-que assigned this task to epriestley.
hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Drydock.
hach-que added subscribers: hach-que, epriestley.

I think we also need a field on a lease to indicate that it's not a permanent lease, or it's a caching lease or something of that nature. Basically the host lease that the working copy has for it's cache shouldn't prevent the host resource from being closed if shouldCloseUnleasedResource returns true.

From IRC:

[00:34:49] <hachque> the only other thing I can think of is maybe seperating out the working copies from the working copy caches
[00:34:58] <hachque> so having two blueprints there
[00:35:08] <hachque> and the working copy caches are always a 1:1 of host to repository
[00:35:39] <hachque> and then the working copy leases try to look up or allocate a working copy cache resource for the host they picked
[00:35:58] <hachque> that would remove the "i must lease against the same host resource that my resource is" problem
[00:36:17] <hachque> because the working copy would just get a host lease on lease (and not on allocation of the working copy resource)
[00:36:25] <epriestley> Yeah. I think that's also reasonable, and a bit less complex, although still probably more complex than I'd expect v0 to be.
[00:37:03] <hachque> yeah I do think that this stuff is probably v1
[00:37:10] <hachque> like my implementation also has some crazy logic for submodule caching
[00:37:21] <hachque> because I have some repos with lots of nested submodules that need to be cached as well
[00:37:38] <hachque> but v0 can probably be like my code - submodule logic - caching
[00:37:52] <hachque> and it should work with some minor changes
[00:38:22] <hachque> thinking about it more, i think seperating the working copies from working copy caches might be the best way forward
[00:38:37] <hachque> since that removes any requirement to add filtering on host leases at all
[00:39:10] <hachque> oh i guess you still need to ask it to filter when trying to obtain a lease on the working copy cache resource
[00:39:23] <hachque> but if that fails because the host can't lease any more, then the lease can just fail
[00:39:32] <hachque> and the working copy blueprint can fall back to a direct clone
[00:40:12] <hachque> if I rearchitecture it like that it probably also makes it much easier to split out v0 no caching and v1 w/ caching
[00:40:38] <hachque> because the caching functionality will reside in a seperate blueprint + some extra logic in the working copy blueprint that can be reviewed seperately

Seems like a reasonable idea, and would split out @epriestley's ideal v0 feature set (no caching) from the v1 feature set (optional caching if the host has room for it).

epriestley renamed this task from Discussion: How should the working copy blueprint behave when the host blueprint is resource constrained? to Support complex resource constraints across multiple resources in Drydock.Sep 23 2015, 5:34 PM
epriestley removed epriestley as the assignee of this task.
epriestley triaged this task as Wishlist priority.
epriestley updated the task description. (Show Details)
epriestley moved this task from Backlog to Far Future on the Drydock board.