[worker/core] Allow yielding in workers to set continuation data
AbandonedPublic
Actions

Authored by hach-que on Aug 8 2015, 11:32 PM.

Details

Reviewers

epriestley

Group Reviewers

Blessed Reviewers

Summary

For Drydock-related diffs, we need to yeild with continuation data. This is because the code needs to appropriately skip ahead work that has already been done if the process yields later. For example, if the worker yields during resource allocation, we do not want to perform any of the lease calculation code again; we want to be able to jump straight ahead to the code that performs resource allocation and re-enter there.

Test Plan

Wrote unit tests for the yield with continuation data, and ran arc unit.

Diff Detail

Repository

rP Phabricator

Branch

task-yield-data

Lint

Lint Passed

Unit

Tests Passed

Build Status

Buildable 7544
Build 8128: [Placeholder Plan] Wait for 30 Seconds
Build 8127: arc lint + arc unit

Event Timeline

hach-que updated this revision to Diff 33407.Aug 8 2015, 11:32 PM

hach-que retitled this revision from to [worker/core] Allow yielding in workers to set continuation data.

hach-que updated this object.

hach-que edited the test plan for this revision. (Show Details)

hach-que added a reviewer: epriestley.

Herald added a reviewer: Blessed Reviewers. · View Herald TranscriptAug 8 2015, 11:32 PM

Herald added a subscriber: Korvin. · View Herald Transcript

Where do we need this, specifically?

We need it:

Where the Drydock patches extend the lock timeout (we need to yield if we don't obtain the lock)
Where the Drydock allocator itself performs any sleeps or retries
Where any blueprint lease allocation is dependant on the resource moving into an open status (we need to yield instead of sleeping)
Where any blueprint resource or lease allocation needs to wait for the resource to become ready (we need to yield instead of sleeping)

In these scenarios we need to "skip ahead" to the part we want to continue at (after the yield). In order to build the code as a state machine that can skip ahead like this, we need to be able to store progress between the task being yielded and the next daemon picking it up.

Technically we could store that information as Drydock lease attributes, but given that this state information is entirely related to task progress and doesn't actually constitute attributes of a lease, I'm hesitant to do that.

joshuaspence added a subscriber: joshuaspence.Aug 9 2015, 10:49 AM

hach-que mentioned this in Z1336: General Chat.Aug 24 2015, 1:35 AM

hach-que mentioned this in D10304: [drydock/core] Implement resource constraints and correct locking mechanisms with yielding for Drydock.Aug 24 2015, 2:39 AM

For each case, what data do we need to store?

Why can't the workers just restart from the beginning?

This revision now requires changes to proceed.Aug 24 2015, 6:22 PM

Okay, so this scenario is an example:

The EC2 blueprint has allocated a resource, and has created the spot instance request in AWS. The spot instance request has been fulfilled but the instance hasn't started yet. The allocator is currently residing inside allocateResource, where it needs to wait for AWS to finalize allocation of the resource.

Currently that loop uses while (true) { /* do some check */ sleep(5); }, but this blocks the queue. Instead, we need to use yielding.

However, in order to correctly continue from the right location in the code, we need to know a few things:

The general allocation logic (directly inside the worker) needs to know that the task is in the middle of allocating a resource, and that it should jump straight into allocateResource instead of performing any lease acquisition logic.
The EC2 allocator needs to know that the spot instance request has been fulfilled, and the spot instance request ID that it can look up.

More roughly, we basically need to store the state of the allocation (all local variables) into the yield exception so that we can continue exactly where we left off when the daemon picks up the task again.

Do any of these cases exist in HEAD, or in the scope of T9252 (where all host leases are immediately allocatable from Almanac)?

I don't believe these cases exist for static infrastructure.

As per https://docs.google.com/spreadsheets/d/13IQBWcjwEtttwJQJ7QzrnPoyQBxhA80oB2q3fynDvks/edit#gid=0, I've managed to get away without this change for a large portion of the Drydock infrastructure changes (those that are common to all blueprints). This change is only important when Drydock starts leasing from external services which may have delays in operations (either during resource allocation or lease acquisition).

For the KVM implementation, I just saved state on the resource / lease as it was processed.

Yeah, I'm kind of doing some of that too. It feels okay, but a little iffy. I'm thinking about possibly providing some special "working data" area for Harbormaster and Drydock stuff that build steps / blueprints are free to write to without worrying about overwriting anything, but don't have any concrete plans for that.

I think writing to the main object should continue working indefinitely, there may just be a slightly cleaner approach eventually.