Page MenuHomePhabricator

[worker/core] Allow yielding in workers to set continuation data
AbandonedPublic

Authored by hach-que on Aug 8 2015, 11:32 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 10, 11:41 PM
Unknown Object (File)
Sat, Jan 4, 3:16 AM
Unknown Object (File)
Wed, Jan 1, 4:52 PM
Unknown Object (File)
Dec 13 2024, 9:32 PM
Unknown Object (File)
Nov 28 2024, 3:59 PM
Unknown Object (File)
Nov 24 2024, 1:44 PM
Unknown Object (File)
Nov 20 2024, 2:09 PM
Unknown Object (File)
Nov 1 2024, 11:27 PM

Details

Reviewers
epriestley
Group Reviewers
Blessed Reviewers
Summary

For Drydock-related diffs, we need to yeild with continuation data. This is because the code needs to appropriately skip ahead work that has already been done if the process yields later. For example, if the worker yields during resource allocation, we do not want to perform any of the lease calculation code again; we want to be able to jump straight ahead to the code that performs resource allocation and re-enter there.

Test Plan

Wrote unit tests for the yield with continuation data, and ran arc unit.

Diff Detail

Repository
rP Phabricator
Branch
task-yield-data
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 7544
Build 8128: [Placeholder Plan] Wait for 30 Seconds
Build 8127: arc lint + arc unit

Event Timeline

hach-que retitled this revision from to [worker/core] Allow yielding in workers to set continuation data.
hach-que updated this object.
hach-que edited the test plan for this revision. (Show Details)
hach-que added a reviewer: epriestley.

Where do we need this, specifically?

We need it:

  • Where the Drydock patches extend the lock timeout (we need to yield if we don't obtain the lock)
  • Where the Drydock allocator itself performs any sleeps or retries
  • Where any blueprint lease allocation is dependant on the resource moving into an open status (we need to yield instead of sleeping)
  • Where any blueprint resource or lease allocation needs to wait for the resource to become ready (we need to yield instead of sleeping)

In these scenarios we need to "skip ahead" to the part we want to continue at (after the yield). In order to build the code as a state machine that can skip ahead like this, we need to be able to store progress between the task being yielded and the next daemon picking it up.

Technically we could store that information as Drydock lease attributes, but given that this state information is entirely related to task progress and doesn't actually constitute attributes of a lease, I'm hesitant to do that.

epriestley edited edge metadata.

For each case, what data do we need to store?

Why can't the workers just restart from the beginning?

This revision now requires changes to proceed.Aug 24 2015, 6:22 PM

Okay, so this scenario is an example:

The EC2 blueprint has allocated a resource, and has created the spot instance request in AWS. The spot instance request has been fulfilled but the instance hasn't started yet. The allocator is currently residing inside allocateResource, where it needs to wait for AWS to finalize allocation of the resource.

Currently that loop uses while (true) { /* do some check */ sleep(5); }, but this blocks the queue. Instead, we need to use yielding.

However, in order to correctly continue from the right location in the code, we need to know a few things:

  • The general allocation logic (directly inside the worker) needs to know that the task is in the middle of allocating a resource, and that it should jump straight into allocateResource instead of performing any lease acquisition logic.
  • The EC2 allocator needs to know that the spot instance request has been fulfilled, and the spot instance request ID that it can look up.

More roughly, we basically need to store the state of the allocation (all local variables) into the yield exception so that we can continue exactly where we left off when the daemon picks up the task again.

Do any of these cases exist in HEAD, or in the scope of T9252 (where all host leases are immediately allocatable from Almanac)?

I don't believe these cases exist for static infrastructure.

As per https://docs.google.com/spreadsheets/d/13IQBWcjwEtttwJQJ7QzrnPoyQBxhA80oB2q3fynDvks/edit#gid=0, I've managed to get away without this change for a large portion of the Drydock infrastructure changes (those that are common to all blueprints). This change is only important when Drydock starts leasing from external services which may have delays in operations (either during resource allocation or lease acquisition).

For the KVM implementation, I just saved state on the resource / lease as it was processed.

Yeah, I'm kind of doing some of that too. It feels okay, but a little iffy. I'm thinking about possibly providing some special "working data" area for Harbormaster and Drydock stuff that build steps / blueprints are free to write to without worrying about overwriting anything, but don't have any concrete plans for that.

I think writing to the main object should continue working indefinitely, there may just be a slightly cleaner approach eventually.