Page MenuHomePhabricator

Plans: 2018 Week 41-44 Bonus Content
Closed, ResolvedPublic

Description

See PHI806. This is a sensitive request with moderate urgency.

See PHI937, which identifies an issue with packages(project).

See PHI944, where an unusually shaped change is requiring a very large amount of memory to parse.

See PHI936. We could smooth out some language on the Workboards "Point Limit" field when maniphest.points is not configured.

In PHI930, we uncovered a translation string issue:

./bin/phd log --id X
Usage Exception: [Invalid Translation!] The "en_US" language data offers variant translations for the plurality or gender of argument 1, but the value for that argument is not an integer, PhutilNumber, or PhutilPerson (it is a value of type "string"). Raw input: <No daemon(s) with id(s) "%s" exist!>.

See PHI943. We should verify the modern behavior of PhabricatorGlobalLock in the presence of KILL <id> on the GET_LOCK() connection. I believe it is clear and explicit after D19702, which made the catch (...) more narrowly scoped.

See PHI904 for a Windows escaping mess.

See PHI841, which includes a request for richer transaction.search results for audits.

See PHI886, which has some more Java syntax highlighter escaping issues.

Drydock

See Also: T13073

See PHI570, which identifies some possible fundamental logic issues in Drydock.

See T13212, which requests an additional filter on Drydock Lease searches.

See PHI570, which asks questions about claims in T12145.

See PHI917, which discusses an issue where disabling bindings in Almanac should knock out Host resources.

See PHI755, which could use verification around WorkingCopy blueprint selection.

Revisions and Commits

rPHU libphutil
D19764
D19763
D19756
rARC Arcanist
D19758
D19758
rP Phabricator
Accepted
D19771
D19765
D19762
D19761
D19760
D19759
D19755
D19754
D19753
D19752
D19750
D19748
D19747
D19742
D19744
D19743
D19739
D19738

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
epriestley raised the priority of this task from Low to Normal.Oct 5 2018, 11:19 PM
epriestley renamed this task from Plans: 2018 Week 41 Bonus Content to Plans: 2018 Week 41-42 Bonus Content.Oct 16 2018, 2:32 PM
epriestley renamed this task from Plans: 2018 Week 41-42 Bonus Content to Plans: 2018 Week 41-43 Bonus Content.Oct 19 2018, 5:54 PM
epriestley updated the task description. (Show Details)

Drydock

I'm hoping to do a relatively small iteration here. The primary goal is to investigate PHI570, which reports some sketchy behavior. I generally want to unblock adopters of Drydock and fix critical weirdness but avoid improving it too much (i.e., avoid a real iteration on T13073) until we can clear other stuff out of the queue.

PHI570 specifically reports that one Host blueprint plus one Working Copy blueprint plus one a limit of 3 does this when building two different repositories:

  1. Allocate R1.
  2. Allocate R1.
  3. Allocate R1.
  4. Build arrives for R2.
  5. Reclaim/destroy resource from (1).
  6. Reclaim/destroy resource from (2).
  7. Reclaim/destroy resource from (3).
  8. Allocate R2.

Steps 1-5 and 8 are expected. Steps (6) and (7) are unexpected and extremely bad.

Building reproduction cases for Drydock stuff is currently a pain. I'd possibly like to bring the "Hoax" blueprints from D19102 upstream, maybe in developer mode only. I'm not sure they'll represent a useful simplification or not, but most outstanding Drydock algorithmic issues seem like they should be reproducible without actually needing real hosts and focus more on prioritization behaviors, locks, races, etc., all of which Hoaxes should be a suitable tool for developing against.

An argument against investing any real effort in Hoaxes is that I don't think I'll ever be confident enough with them to push back on an issue with a claim that I probably failed to reproduce it against Hoaxes. Another avenue would be to invest in a more robust EC2 blueprint and reduce the cost of building a real reproduction case. However, I'd like to avoid that for now.

From previous work, I already have functional Host and Working Copy blueprints locally so I'm likely punting on the Hoax/EC2 issue.

First up, bin/drydock lease makes it very hard to build ad-hoc working copy leases because it takes --attributes x=y,a=b but Working Copy resources are specified with a complex repositories.map. I think --attributes is underpowered and am just going to replace it with --attributes <path-to-JSON-file>. This makes working copies reasonable to bin/drydock lease and generally makes testing/debugging/repeating things a bit easier since you can collect lease attributes in one place. I'll upstream this shortly.

Next, I hit the ssh -o LogLevel=quiet issue from T13121#240184 again. I don't really have a great way forward here but -o LogLevel=quiet sure is garbage. Perhaps we'll give you an option to configure it and use StrictHostKeyChecking=no with stderr filtering if you don't. This is bad but dramatically less bad than -o LogLevel=quiet which is deeply horrible. We don't have a good way to proxy/filter stderr right now but I hate -o LogLevel=quiet so much that I may build one and upstream it. One argument in favor of this is that we could apply a similar filter to DiffusionMercurialCommandEngine::filterMercurialDebugOutput(...), which has a similar effect but can not process streams and can not be configured directly on the ExecFuture.

(We have FutureProxy, but it isn't powerful enough to proxy a full ExecFuture right now. I'm inclined to think that setStdout/StderrSink() is perhaps a better approach here than actually proxying the whole future.)

Next, I hit T13179 again. Yeesh. I rebased on my local hack-around for now.

With one Host blueprint and one Working Copy blueprint, I leased three spellbook working copies. Then I released the leases.

While releasing the leases, I accidentally released the wrong lease. This left a working copy resource without a host lease. We should have better safety checks around this; when the lease holder is a Drydock object we can refuse to release it outside of resource cleanup. I released all the Working Copy resources and rebuilt them, then released the correct leases.

NOTE: These leases take about 18s to activate -- we're likely missing a wakeup-from-yield somewhere?

So, starting state is like this:

Screen Shot 2018-10-23 at 5.49.15 PM.png (231×314 px, 19 KB)

Next, I leased a koans working copy. This appeared to be working:

<Waiting For Resource> Waiting for available resources from: Blueprint 2: Working Copies.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting For Resource> Waiting for available resources from: Blueprint 2: Working Copies.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Reclaimed Resources> Reclaimed resource Resource 64: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
...

The time spent in the first two cases ("Waiting for available resources...") is expected. We have a 3-minute backoff after update before a resource can be reclaimed.

However, it then failed:

...
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

<Lease Destroyed> Lease destroyed.
[2018-10-23 17:50:56] EXCEPTION: (Exception) Lease has already been destroyed! at [<phabricator>/src/applications/drydock/management/DrydockManagementLeaseWorkflow.php:198]
...

...because koans is empty and we run ... && git fetch && git reset --hard HEAD, but git reset --hard HEAD does not function in an empty repository because there is no HEAD. It looks like we can fix this by using git reset --hard instead of git reset --hard HEAD; the former has effectively the same meaning but works when the repository is empty. I'll upstream this change.

This failure happens during lease activation; the resource still builds. So now I have two spellbook working copies and one koans working copy.

I'm going to try to lease koans again (this is with the --reset patch):

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/99/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 99

Lease activated in 3,096ms.

So far so good. Let's try grabbing another koans lease without releasing anything. So the expectation is: one leased koans resource, two unleased spellbook resources. We should immediately reclaim a spellbook and turn it into a koans resource:

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/100/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Reclaimed Resources> Reclaimed resource Resource 65: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 100

Lease activated in 34,167ms.

That took longer than it should have (missing wakeups, per above?) but worked, reclaiming one of the spellbook resources and turning it into a koans.

Let's try the same thing again, this should convert the third resource:

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/102/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Reclaimed Resources> Reclaimed resource Resource 66: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 102

Lease activated in 34,209ms.

So that looks good.

The missing wakeup is likely that DrydockWorker->reclaimResource() does not awaken the task which started the reclaim. It reasonably can, and should. This isn't a completely trivial change so I'm unsure about upstreaming it, but it wouldn't be terribly difficult to fix.

My theory on the Drydock behavior is that we're getting into an "A, A, A" state. A lease comes in that needs a "B", so it releases a resource.

If the release + destroy take more than 15 seconds, the "B" lease may get to wake up again and release another resource. If they take more than 30 seconds, it may get to do it again, and so on.

When a lease triggers a resource release, I'm going to stop it from triggering another release until the first release completes.

See PHI943. We should verify the modern behavior of PhabricatorGlobalLock in the presence of KILL <id> on the GET_LOCK() connection. I believe it is clear and explicit after D19702, which made the catch (...) more narrowly scoped.

I used this script:

lock.php
<?php

require_once 'scripts/init/init-script.php';

$lock1 = PhabricatorGlobalLock::newLock('test');
$lock1->lock(120);

echo "Got lock. Press ^C to exit.\n";
while (true) {
  sleep(1);
}

I ran it in two windows, then used KILL <id> to kill the process trying to get the lock (the script from the second window) three times (because we retry on lost connection three times). That produced this explicit connection loss exception:

[2018-10-24 10:35:20] EXCEPTION: (AphrontConnectionLostQueryException) #2006: MySQL server has gone away

This error may occur if your MySQL 'wait_timeout' or 'max_allowed_packet' configuration values are set too low. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:304]
arcanist(head=experimental, ref.master=2650e8627a20, ref.experimental=9d59e9e5908e), corgi(head=master, ref.master=6371578c9d32), instances(head=master, ref.master=952881743ed3), libcore(), phabricator(head=master, ref.master=e2cf1e42887a, custom=5), phutil(head=master, ref.master=603209bb1756), services(head=master, ref.master=019a12a6556d)
  #0 AphrontBaseMySQLDatabaseConnection::throwCommonException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:348]
  #1 AphrontBaseMySQLDatabaseConnection::throwQueryCodeException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:289]
  #2 AphrontBaseMySQLDatabaseConnection::throwQueryException(mysqli) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:185]
  #3 AphrontBaseMySQLDatabaseConnection::executeRawQuery(string) called at [<phutil>/src/xsprintf/queryfx.php:8]
  #4 queryfx(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phutil>/src/xsprintf/queryfx.php:13]
  #5 queryfx_all(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phutil>/src/xsprintf/queryfx.php:19]
  #6 queryfx_one(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:143]
  #7 PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:168]
  #8 PhutilLock::lock(integer) called at [<phabricator>/lock.php:6]

So modern behavior here seems reasonably clear. I'll maybe broaden the error message a bit.

See PHI755, which could use verification around WorkingCopy blueprint selection.

PHI755 has more details, but I verified that creating your own WorkingCopyBlueprintImplementation that provides an alternate way to build working-copy resources appears to be working as intended (i.e., this is supported and expected, and providing new types of blueprints which build existing types of resources like host and working-copy is part of the vision of Drydock). Here's a "Land Revision" working with a "+ Magic" working copy blueprint instead of a default upstream blueprint (the two are functionally very similar):

Screen Shot 2018-10-25 at 10.00.02 AM.png (882×1 px, 121 KB)

See PHI916. T11908 should move forward to the "migrate 700 callsites" stage, at least.

T11908 describes the current plan on putting multiple database connections on the same actual port-to-port connection.

Since this involves defining a new %R, it might make sense to try to define %P and alternatives for %Q in T6960 at the same time.

And if we're touching everything anyway, trying to fix T12678 sounds good too, maybe.

Of course, this ends up being kind of a lot of extra scope. I'm not sure how involved these steps will actually be.

The %R in T11908 is relatively less involved than everything else so maybe we don't try to do %P, %Q, or "numbers are actually numbers" for now.

epriestley renamed this task from Plans: 2018 Week 41-43 Bonus Content to Plans: 2018 Week 41-44 Bonus Content.Oct 26 2018, 1:09 PM
epriestley renamed this task from Plans: 2018 Week 41-44 Bonus Content to Plans: 2018 Week 41-45 Bonus Content.Nov 6 2018, 12:18 PM
epriestley renamed this task from Plans: 2018 Week 41-45 Bonus Content to Plans: 2018 Week 41-44 Bonus Content.Nov 6 2018, 12:50 PM
epriestley closed this task as Resolved.
epriestley updated the task description. (Show Details)