Details

Reviewers

Group Reviewers

Summary

This implements an Amazon EC2 blueprint that allocates and leases hosts from EC2. It starts EC2 instances according to the parameters specified on the blueprint, using the public key of a Passphrase key as the key pair for AWS (which means Harbormaster can automatically connect to the machine on default Linux-based AWS images).

It handles allocation of Elastic IPs, which provides a public IP address for connecting to the allocated machine when Phabricator is not hosted in the same VPC (or on Amazon).

This is built to allocate into a VPC; it doesn't support allocating a classic EC2 instance (and mainly because AWS is trying to deprecate that model anyway).

It provides min-count, max-count and leases-per-instance parameters to allow the user to configure the ideal setup, and to set minimum and maximum hosts for the blueprint.

This also implements the appropriate logic for performing clean up when releasing leases or closing resources (which in this case terminates the instance and releases the Elastic IP).

NOTE: This hasn't been tested with Windows AMIs yet, because that requires significant more setup to test. Basically, a Windows machine needs to be configured with FreeSSHD and the image saved; there are no default Windows AMIs provided by Amazon that have a SSH daemon running by default (and with the key pair specified).

Test Plan

Tested with various configurations of min-count, max-count and leases-per-instance and saw the correct behaviour when leasing and releasing via bin/drydock.

In addition, this has been tested with a Harbormaster build plan that allocates a Linux host and runs uname -a on it, which worked correctly, as can be seen in:

Diff Detail

Repository

rP Phabricator

Branch

drydock-ec2

Lint

Lint Passed

Severity	Location	Code	Message
Advice	src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php:148	XHP16	TODO Comment
Advice	src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php:359	XHP16	TODO Comment
Advice	src/applications/drydock/blueprint/DrydockLocalHostBlueprintImplementation.php:103	XHP16	TODO Comment
Advice	src/applications/drydock/blueprint/DrydockPreallocatedHostBlueprintImplementation.php:129	XHP16	TODO Comment
Advice	src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php:128	XHP16	TODO Comment

Unit

Tests Passed

Build Status

Buildable 2139
Build 2143: [Placeholder Plan] Wait for 30 Seconds

Event Timeline

hach-que updated this revision to Diff 24560.Aug 9 2014, 5:56 PM

hach-que retitled this revision from to Implement Amazon EC2 blueprint for Drydock.

hach-que updated this object.

hach-que edited the test plan for this revision. (Show Details)

hach-que added a reviewer: epriestley.

Herald added a reviewer: Blessed Reviewers. · View Herald TranscriptAug 9 2014, 5:56 PM

Herald added subscribers: Korvin, epriestley. · View Herald Transcript

Harbormaster completed remote builds in B2133: Diff 24560.Aug 9 2014, 5:57 PM

hach-que updated this object.Aug 9 2014, 5:57 PM

hach-que edited the test plan for this revision. (Show Details)

hach-que edited edge metadata.

hach-que added inline comments.

src/applications/drydock/interface/command/DrydockSSHCommandInterface.php
63	This prevents SSH from prompting for a password when using `bin/drydock lease` on the command line (due to the connection test), because `bin/drydock lease` has a TTY available and thus it prompts otherwise.

Ensure $caused_by_closing_resource is true when releaseLease is called from closeResource.

hach-que updated this object.Aug 9 2014, 6:05 PM

Harbormaster completed remote builds in B2134: Diff 24561.Aug 9 2014, 6:06 PM

Mark resources as CLOSING to avoid races and allow leasing on pending resources

Harbormaster completed remote builds in B2139: Diff 24567.Aug 10 2014, 4:40 AM

Fix race conditions and exceptions

Harbormaster completed remote builds in B2140: Diff 24568.Aug 10 2014, 5:03 AM

So there's still possible race conditions within Drydock when it comes to allocating resources. One particular scenario that I hit was:

There's >8 taskmasters capable of picking up tasks. There are currently 0 host resources, and 4 builds all start at the same time (whose first task is Lease Host).

Basically what happens here is 4 of the taskmasters pick up the Lease Host task and all start allocating at the same time. They check to see if there are open or pending resources (and there aren't), so they decide to allocate a resource (which will pass the resource limits because according to each allocator there are 0 open and 0 pending resources). This race condition is basically present up until allocateResource first calls save(). You can see the race condition in effect here:

To solve this race condition, I think we need to perform a global lock over the full allocator logic, and unlock right after save() is first called to place a pending resource. We could do this by creating the resource initially inside the allocator, saving it, unlocking and then passing it to allocateResource. Then we need to handle the logic to clean up the resource if allocateResource fails (especially since other allocators may have leased on it because it was pending).

waynea added a subscriber: waynea.Aug 10 2014, 5:40 AM

aifreedom added a subscriber: aifreedom.Aug 10 2014, 8:06 AM

For the allocator race, I think your solution sounds good, except that I'd try to do as little logic as possible in the lock. Can the logic go roughly like this?

Lock on resource type.
Check if we can allocate; if we can, save a new STATUS_ALLOCATING resource (this is a new "placeholder" status which has no logic and just makes other allocators fail) with enough data that it "counts" for other allocators.
Release the lock.
Then have executeAllocateResource() (or whatever) fill in the resource instead of creating it?

In particular, I definitely don't want to hold locks while making AWS service calls.

Some inlines in general. Major issues are "select all stuff" and locking behavior.

About half of this is fine to upstream without resolving the tricky bits -- maybe separate out the field stuff and things like the SSH connection timeout? We can get those in as-is and that will make this easier to move forward with.

src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php
208–215	Can this use $this->getInterface()?
287–306	Do we need this special-over-the-limit behavior for now? It kind of seems like bad news. If we just say "no", what goes wrong?
313–318	Did we already do this? It seems weird that we let you acquire leases on pending resources...
325–350	some ocpy/paste junk from preallocated host I think
381	(This should be SCP in the non-Windows world, probably...)
src/applications/drydock/blueprint/DrydockBlueprintImplementation.php
130–150	I don't want to pursue this, at least for now -- we should assume there may be a very large number of resources and leases (e.g, at least a million of each). We should not need to load them all to perform allocation tasks. Behaviors like balancing loads across nodes are reasonable, but we should pursue them a few iterations out, once all the basics work, and in ways that avoid "load everything".
353–357	Same deal as above.

This revision now requires changes to proceed.Aug 11 2014, 9:35 PM

hach-que added inline comments.Aug 11 2014, 10:28 PM

src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php
287–306	Builds will randomly fail. Executing another build in parallel on the same machine and taking the performance hit is probably more preferred than allocating another EC2 instance (or specifying a higher limit and having slower builds by default).
313–318	Previously it didn't lease against pending resources, but it was then impossible to adhere to resource limits because a resource could be allocating / pending and wouldn't count as available for new leases. What this means is that previously you could start 10 builds and it would have to start 10 EC2 instances (resources), even though it's probably faster to wait for an existing resource to finalize allocation and lease against that.
325–350	I think this is still needed (it creates a directory underneath the path specified on the blueprint, so that leases have a unique directory on the host).
381	Probably. The SFTP interface has been tested pretty well in production for uploading files from a host, for about the last 6 months, so while we could add an SCP interface for non-Windows, I don't think it's highly valuable right now.
src/applications/drydock/blueprint/DrydockBlueprintImplementation.php
130–150	These are actually only used for counting the number of resources / leases on a resource. So if scalability is an issue, we should add a field onto blueprints / resources that indicate how many objects they've currently got allocated / leased out. Then we can leave it up to shouldAllocateLease to query what it needs to.
353–357	Again this is just used to count the number of resources, so this can use the same field as the above logic.

Will rebase this once D10224 and D10225 land.

Rebased on master

Harbormaster completed remote builds in B2160: Diff 24610.Aug 11 2014, 11:26 PM

Still need to fix the allocator lock.

hach-que removed reviewers: Blessed Reviewers, epriestley.Aug 17 2014, 3:55 AM

Move resource and lease counting to DrydockAllocationContext

Harbormaster completed remote builds in B2258: Diff 24770.Aug 17 2014, 3:59 AM

No need to re-review this yet, just pushed an update so i can access it on different computers.

thoughtpolice added a subscriber: thoughtpolice.Aug 17 2014, 10:22 AM

Fix race conditions in resource allocation

hach-que added reviewers: epriestley, Blessed Reviewers.Aug 17 2014, 11:48 AM

Harbormaster completed remote builds in B2259: Diff 24771.Aug 17 2014, 11:49 AM

So I've verified that I've fixed the major race condition which allowed more than the maximum number of resources to be allocated (as per @epriestley's instructions).

However, there's now a scenario where if you run the following command with the allocator:

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do bin/drydock lease --type host --attributes platform=linux & done

What happens is that (correctly) only two resources will be placed into the pending state, and those leases will be resolved correctly. There may be 1 or 2 additional lease that lease correctly against the pending resources. However, a substantial number of leases just simply won't resolve; the bin/drydock process never terminates. I don't know what could be causing this, but my guess is that it's something in the new locking logic.

I'm not sure what aspect of this code could be causing a permanent wait / deadlock, but I've verified that it isn't waiting for resources to open (i.e. leases against pending resources).

src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php
336–337	I've verified that it is not waiting here.

Note that when I set the lease and resource statuses manually, I get:

[2014-08-17 11:56:48] EXCEPTION: (Exception) Lease has already been released! at [<phabricator>/src/applications/drydock/storage/DrydockLease.php:152]
  #0 DrydockLease::waitForLeases(array) called at [<phabricator>/src/applications/drydock/storage/DrydockLease.php:182]
  #1 DrydockLease::waitUntilActive() called at [<phabricator>/src/applications/drydock/management/DrydockManagementLeaseWorkflow.php:50]
  #2 DrydockManagementLeaseWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:394]
  #3 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:290]
  #4 PhutilArgumentParser::parseWorkflows(array) called at [<phabricator>/scripts/drydock/drydock_control.php:21]

This implies that the leases do exist in the database, but just aren't appearing against any known resource.

This is the status of all the (unreleased) leases in the database. Curiously, they don't appear to have a resource ID, yet the lease still exists?

MariaDB [phabricator_drydock]> select * from drydock_lease where status <> 2;
+-----+--------------------------------+------------+--------+-------+-----------+------------------------------------------+-------------+--------------+--------+--------------+                                                                                             
| id  | phid                           | resourceID | status | until | ownerPHID | attributes                               | dateCreated | dateModified | taskID | resourceType |
+-----+--------------------------------+------------+--------+-------+-----------+------------------------------------------+-------------+--------------+--------+--------------+
| 324 | PHID-DRYL-opjtnkcoafy3rnrim3v3 |         94 |      1 |  NULL | NULL      | {"platform":"linux","path":"\/srv\/324"} |  1408276707 |   1408276785 |      0 | host         |
| 325 | PHID-DRYL-v5hvm4uw3pkslwtxmo32 |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 326 | PHID-DRYL-v5dbfvip4dbrfvpgt2ci |         95 |      1 |  NULL | NULL      | {"platform":"linux","path":"\/srv\/326"} |  1408276707 |   1408276785 |      0 | host         |
| 327 | PHID-DRYL-br6tp5ovit2wx2xv3ivq |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 328 | PHID-DRYL-tvcd55ilqzvaqm2hcrbj |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 329 | PHID-DRYL-2o5logjxouzc4nluldo4 |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 330 | PHID-DRYL-sk62cqyqibeovhdhcq5c |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 331 | PHID-DRYL-jbarszbjdjai7a2ckm6r |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 332 | PHID-DRYL-yspccndrzg637tpjie3h |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 333 | PHID-DRYL-i6dzwxqvgf5dyxwidety |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 334 | PHID-DRYL-b3n5zzjxhjtgzqxpch73 |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 335 | PHID-DRYL-htgce7ypptlbsdrm6kvy |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 336 | PHID-DRYL-hto7s4h2eojmd4otxgxd |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 337 | PHID-DRYL-qjilvvxxaksfuc3gzyds |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
| 338 | PHID-DRYL-4ryx74f56csh2265ukae |       NULL |      0 |  NULL | NULL      | {"platform":"linux"}                     |  1408276707 |   1408276707 |      0 | host         |
+-----+--------------------------------+------------+--------+-------+-----------+------------------------------------------+-------------+--------------+--------+--------------+

Sorry for all of the comments.

Evidently, acquireLease does not get called, because otherwise these leases would have a resource ID. The task queue is empty, and when running a bunch of daemons with bin/phd debug PhabricatorTaskmasterDaemon, none of them appear to crash. So I don't think this is a fatal error causing an early exit.

The only thing I can see in the allocator that would cause it to terminate or exit, without throwing an exception and without running acquireLease would be return that occurs when no blueprints can allocate a resource. But there's nothing in the Drydock log to indicate it hits this code.

The log for a pending lease looks like this:

1 Blueprints Enabled
Found 1 Blueprints
0 Pending Resource(s) Remain
Found 0 Pending Resource(s)
0 Open Resource(s) Remain
Found 0 Open Resource(s)
Allocating Lease

So it's almost certainly hitting the locking code and then waiting forever. @epriestley, is there any reason why this code would deadlock or wait forever?

src/applications/drydock/worker/DrydockAllocatorWorker.php
199–200	Given that "1 Blueprints Enabled" shows in the logs for a pending lease (that is waiting forever), it is almost certainly deadlocked here.

You mentioned elsewhere that you moved away from PhutilGlobalLock to beginReadLocking() -- can you walk me through that?

That is, I would expect that PhutilGlobalLock is the correct lock type to use here and will be far simpler than using row locks.

So from memory, I started with something like:

$lock = PhabricatorGlobalLock::newLock('drydockallocation');
$lock->lock();

try {
  // ...
  $lock->unlock();
} catch (Exception $ex) {
  $lock->unlock();
}

But I soon found out that lock is not blocking, and throws PhutilLockException if it can't lock successfully. So I thought, well maybe this is a lock you have to wait for; so I then tried something like:

$lock = PhabricatorGlobalLock::newLock('drydockallocation');
while (true) {
  try {
    $lock->lock();
    break;
  } catch (PhutilLockException $ex) {
    // Wait a little while before trying to grab the lock again.
    usleep(rand(0, 500000));
  }
}

// ...

But then I started hitting "Lock '...' has already been locked by this process.", even after I restarted the daemons and web server to clear any previous locks. At that point I wasn't quite sure what I was doing wrong, tried beginReadLocking and got 99% of the expected locking behaviour.

Use $lock->lock(123); to wait 123 seconds for the lock.

Oh okay, that explains why I had all those issues :)

(I'm used to languages / frameworks where the default is to block on locks, with optional non-blocking mode)

Use PhabricatorGlobalLock. Will test this when I'm in the office since I don't want to be calling AWS APIs on a sketchy mobile internet connection.

Harbormaster completed remote builds in B2279: Diff 24801.Aug 19 2014, 10:39 PM

Include the working directory fix in D10293 against the new EC2 allocator as well

Harbormaster completed remote builds in B2283: Diff 24806.Aug 19 2014, 10:49 PM

Testing the new locking behaviour now, will report back when I've verified that it behaves correctly.

Will rebase on D10304 once that lands / the locking issue is solved.

Sending through a new diff.

hach-que mentioned this in D10304: [drydock/core] Implement resource constraints and correct locking mechanisms with yielding for Drydock.Aug 24 2015, 7:48 PM

Implement Amazon EC2 blueprint for Drydock
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 24567

src/__phutil_library_map__.php

src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockLocalHostBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockPreallocatedHostBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php

src/applications/drydock/constants/DrydockResourceStatus.php

src/applications/drydock/controller/DrydockBlueprintEditController.php

src/applications/drydock/controller/DrydockBlueprintViewController.php

src/applications/drydock/customfield/DrydockBlueprintCoreCustomField.php

src/applications/drydock/customfield/DrydockBlueprintCustomField.php

src/applications/drydock/interface/command/DrydockSSHCommandInterface.php

src/applications/drydock/storage/DrydockBlueprint.php

src/applications/drydock/storage/DrydockResource.php

src/applications/drydock/worker/DrydockAllocatorWorker.php

Implement Amazon EC2 blueprint for DrydockAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 24567

src/__phutil_library_map__.php

src/applications/drydock/blueprint/DrydockAmazonEC2HostBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockLocalHostBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockPreallocatedHostBlueprintImplementation.php

src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php

src/applications/drydock/constants/DrydockResourceStatus.php

src/applications/drydock/controller/DrydockBlueprintEditController.php

src/applications/drydock/controller/DrydockBlueprintViewController.php

src/applications/drydock/customfield/DrydockBlueprintCoreCustomField.php

src/applications/drydock/customfield/DrydockBlueprintCustomField.php

src/applications/drydock/interface/command/DrydockSSHCommandInterface.php

src/applications/drydock/storage/DrydockBlueprint.php

src/applications/drydock/storage/DrydockResource.php

src/applications/drydock/worker/DrydockAllocatorWorker.php

Implement Amazon EC2 blueprint for Drydock
AbandonedPublic
Actions

Revision Contents
Changeset List