Page MenuHomePhabricator

Plans: 2018 Week 41-44 Bonus Content
Closed, ResolvedPublic

Description

See PHI806. This is a sensitive request with moderate urgency.

See PHI937, which identifies an issue with packages(project).

See PHI944, where an unusually shaped change is requiring a very large amount of memory to parse.

See PHI936. We could smooth out some language on the Workboards "Point Limit" field when maniphest.points is not configured.

In PHI930, we uncovered a translation string issue:

./bin/phd log --id X
Usage Exception: [Invalid Translation!] The "en_US" language data offers variant translations for the plurality or gender of argument 1, but the value for that argument is not an integer, PhutilNumber, or PhutilPerson (it is a value of type "string"). Raw input: <No daemon(s) with id(s) "%s" exist!>.

See PHI943. We should verify the modern behavior of PhabricatorGlobalLock in the presence of KILL <id> on the GET_LOCK() connection. I believe it is clear and explicit after D19702, which made the catch (...) more narrowly scoped.

See PHI904 for a Windows escaping mess.

See PHI841, which includes a request for richer transaction.search results for audits.

See PHI886, which has some more Java syntax highlighter escaping issues.

Drydock

See Also: T13073

See PHI570, which identifies some possible fundamental logic issues in Drydock.

See T13212, which requests an additional filter on Drydock Lease searches.

See PHI570, which asks questions about claims in T12145.

See PHI917, which discusses an issue where disabling bindings in Almanac should knock out Host resources.

See PHI755, which could use verification around WorkingCopy blueprint selection.

Details

Differential Revisions
D19746: Make the Harbormaster build plan edit engine forms configurable
Commits
D19771 / rPe26c4bddab33: Replace magical "branch" behavior in "diffusion.branchquery" with an explicit…
D19765 / rP798a391e5a90: Add test coverage for "%R" in qsprintf() and convert LiskDAO to support it
D19764 / rPHUcf96fd681e7d: Support "%R" (Database + Table Ref) in qsprintf(...)
D19762 / rPb950f877c50c: Allow Drydock Blueprints to control "supplemental allocation" behavior so all…
D19758 / rARC637c6584c6ea: (experimental) Work around a Windows escaping issue and security conecern in…
D19763 / rPHUf9a65ebb0e0c: Further refine the Java syntax lexer for escaped strings
D19761 / rP57b4b5981947: When a Drydock host based on an Almanac blueprint has its binding disabled…
D19760 / rP65e953658a32: Expose Audit actions for "transaction.search" in a basic way
D19759 / rP61ec43420808: Remove unicode marks for "Accept/Raise Concern" in Audit
D19758 / rARC83661809e532: Work around a Windows escaping issue and security conecern in "hg cat --output .
D19756 / rPHUc2a259398ff4: Provide users a hint that "KILL <process>" is a possible source of connection…
D19755 / rPa7c008708d04: Correct a mangled translation string in "bin/phd log --id X"
D19754 / rPf6122547d774: When a lease triggers a resource allocation for a resource which must activate…
D19753 / rP78ab675bd8aa: After a Drydock lease triggers a resource to be reclaimed, stop it from…
D19752 / rPe9309fdd6ada: When a Drydock lease schedules a resource to be reclaimed, awaken the lease…
D19750 / rP6deb09efcdb4: When leasing a Working Copy in Drydock, just "git reset --hard" so empty…
D19748 / rPe2cf1e42887a: Skip copied code detection for changes that are too large for it to be useful
D19747 / rPe0dea4c4862f: Fix `packages(project)` to work properly and add it to…
D19742 / rP0a51bc4f05bf: Add a space after "View Inline" in mail to prevent double-click on the filename…
D19744 / rP8bffc9ea0efe: In "bin/bulk export", require "--output <path>" by default
D19743 / rP4f54d483d5bf: Support export of revisions to Excel/CSV/JSON/etc
D19739 / rP4f557ff075ad: When using "bin/bulk export --overwrite", actually overwrite the file
D19738 / rP4928c34d00c2: Allow "bin/bulk export" to merge multiple queries and accept more flexible flags

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
epriestley raised the priority of this task from Low to Normal.Oct 5 2018, 11:19 PM
epriestley updated the task description. (Show Details)Oct 10 2018, 4:08 PM
epriestley renamed this task from Plans: 2018 Week 41 Bonus Content to Plans: 2018 Week 41-42 Bonus Content.Oct 16 2018, 2:32 PM
epriestley renamed this task from Plans: 2018 Week 41-42 Bonus Content to Plans: 2018 Week 41-43 Bonus Content.Oct 19 2018, 5:54 PM
epriestley updated the task description. (Show Details)
epriestley updated the task description. (Show Details)Oct 19 2018, 5:59 PM
epriestley updated the task description. (Show Details)Oct 19 2018, 6:22 PM
epriestley updated the task description. (Show Details)
epriestley updated the task description. (Show Details)Oct 19 2018, 6:51 PM
epriestley updated the task description. (Show Details)Oct 23 2018, 3:30 PM
epriestley updated the task description. (Show Details)Oct 23 2018, 8:24 PM
epriestley updated the task description. (Show Details)
epriestley updated the task description. (Show Details)Oct 23 2018, 8:56 PM

Drydock

I'm hoping to do a relatively small iteration here. The primary goal is to investigate PHI570, which reports some sketchy behavior. I generally want to unblock adopters of Drydock and fix critical weirdness but avoid improving it too much (i.e., avoid a real iteration on T13073) until we can clear other stuff out of the queue.

PHI570 specifically reports that one Host blueprint plus one Working Copy blueprint plus one a limit of 3 does this when building two different repositories:

  1. Allocate R1.
  2. Allocate R1.
  3. Allocate R1.
  4. Build arrives for R2.
  5. Reclaim/destroy resource from (1).
  6. Reclaim/destroy resource from (2).
  7. Reclaim/destroy resource from (3).
  8. Allocate R2.

Steps 1-5 and 8 are expected. Steps (6) and (7) are unexpected and extremely bad.

Building reproduction cases for Drydock stuff is currently a pain. I'd possibly like to bring the "Hoax" blueprints from D19102 upstream, maybe in developer mode only. I'm not sure they'll represent a useful simplification or not, but most outstanding Drydock algorithmic issues seem like they should be reproducible without actually needing real hosts and focus more on prioritization behaviors, locks, races, etc., all of which Hoaxes should be a suitable tool for developing against.

An argument against investing any real effort in Hoaxes is that I don't think I'll ever be confident enough with them to push back on an issue with a claim that I probably failed to reproduce it against Hoaxes. Another avenue would be to invest in a more robust EC2 blueprint and reduce the cost of building a real reproduction case. However, I'd like to avoid that for now.

From previous work, I already have functional Host and Working Copy blueprints locally so I'm likely punting on the Hoax/EC2 issue.

First up, bin/drydock lease makes it very hard to build ad-hoc working copy leases because it takes --attributes x=y,a=b but Working Copy resources are specified with a complex repositories.map. I think --attributes is underpowered and am just going to replace it with --attributes <path-to-JSON-file>. This makes working copies reasonable to bin/drydock lease and generally makes testing/debugging/repeating things a bit easier since you can collect lease attributes in one place. I'll upstream this shortly.

Next, I hit the ssh -o LogLevel=quiet issue from T13121#240184 again. I don't really have a great way forward here but -o LogLevel=quiet sure is garbage. Perhaps we'll give you an option to configure it and use StrictHostKeyChecking=no with stderr filtering if you don't. This is bad but dramatically less bad than -o LogLevel=quiet which is deeply horrible. We don't have a good way to proxy/filter stderr right now but I hate -o LogLevel=quiet so much that I may build one and upstream it. One argument in favor of this is that we could apply a similar filter to DiffusionMercurialCommandEngine::filterMercurialDebugOutput(...), which has a similar effect but can not process streams and can not be configured directly on the ExecFuture.

(We have FutureProxy, but it isn't powerful enough to proxy a full ExecFuture right now. I'm inclined to think that setStdout/StderrSink() is perhaps a better approach here than actually proxying the whole future.)

Next, I hit T13179 again. Yeesh. I rebased on my local hack-around for now.

With one Host blueprint and one Working Copy blueprint, I leased three spellbook working copies. Then I released the leases.

While releasing the leases, I accidentally released the wrong lease. This left a working copy resource without a host lease. We should have better safety checks around this; when the lease holder is a Drydock object we can refuse to release it outside of resource cleanup. I released all the Working Copy resources and rebuilt them, then released the correct leases.

NOTE: These leases take about 18s to activate -- we're likely missing a wakeup-from-yield somewhere?

So, starting state is like this:

Next, I leased a koans working copy. This appeared to be working:

<Waiting For Resource> Waiting for available resources from: Blueprint 2: Working Copies.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting For Resource> Waiting for available resources from: Blueprint 2: Working Copies.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Reclaimed Resources> Reclaimed resource Resource 64: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
...

The time spent in the first two cases ("Waiting for available resources...") is expected. We have a 3-minute backoff after update before a resource can be reclaimed.

However, it then failed:

...
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

<Lease Destroyed> Lease destroyed.
[2018-10-23 17:50:56] EXCEPTION: (Exception) Lease has already been destroyed! at [<phabricator>/src/applications/drydock/management/DrydockManagementLeaseWorkflow.php:198]
...

...because koans is empty and we run ... && git fetch && git reset --hard HEAD, but git reset --hard HEAD does not function in an empty repository because there is no HEAD. It looks like we can fix this by using git reset --hard instead of git reset --hard HEAD; the former has effectively the same meaning but works when the repository is empty. I'll upstream this change.

This failure happens during lease activation; the resource still builds. So now I have two spellbook working copies and one koans working copy.

I'm going to try to lease koans again (this is with the --reset patch):

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/99/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 99

Lease activated in 3,096ms.

So far so good. Let's try grabbing another koans lease without releasing anything. So the expectation is: one leased koans resource, two unleased spellbook resources. We should immediately reclaim a spellbook and turn it into a koans resource:

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/100/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Reclaimed Resources> Reclaimed resource Resource 65: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 100

Lease activated in 34,167ms.

That took longer than it should have (missing wakeups, per above?) but worked, reclaiming one of the spellbook resources and turning it into a koans.

Let's try the same thing again, this should convert the third resource:

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/102/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Reclaimed Resources> Reclaimed resource Resource 66: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 102

Lease activated in 34,209ms.

So that looks good.

The missing wakeup is likely that DrydockWorker->reclaimResource() does not awaken the task which started the reclaim. It reasonably can, and should. This isn't a completely trivial change so I'm unsure about upstreaming it, but it wouldn't be terribly difficult to fix.

epriestley updated the task description. (Show Details)Oct 24 2018, 1:03 PM

My theory on the Drydock behavior is that we're getting into an "A, A, A" state. A lease comes in that needs a "B", so it releases a resource.

If the release + destroy take more than 15 seconds, the "B" lease may get to wake up again and release another resource. If they take more than 30 seconds, it may get to do it again, and so on.

When a lease triggers a resource release, I'm going to stop it from triggering another release until the first release completes.

epriestley updated the task description. (Show Details)Oct 24 2018, 5:31 PM
epriestley updated the task description. (Show Details)Oct 24 2018, 5:35 PM

See PHI943. We should verify the modern behavior of PhabricatorGlobalLock in the presence of KILL <id> on the GET_LOCK() connection. I believe it is clear and explicit after D19702, which made the catch (...) more narrowly scoped.

I used this script:

lock.php
<?php

require_once 'scripts/init/init-script.php';

$lock1 = PhabricatorGlobalLock::newLock('test');
$lock1->lock(120);

echo "Got lock. Press ^C to exit.\n";
while (true) {
  sleep(1);
}

I ran it in two windows, then used KILL <id> to kill the process trying to get the lock (the script from the second window) three times (because we retry on lost connection three times). That produced this explicit connection loss exception:

[2018-10-24 10:35:20] EXCEPTION: (AphrontConnectionLostQueryException) #2006: MySQL server has gone away

This error may occur if your MySQL 'wait_timeout' or 'max_allowed_packet' configuration values are set too low. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:304]
arcanist(head=experimental, ref.master=2650e8627a20, ref.experimental=9d59e9e5908e), corgi(head=master, ref.master=6371578c9d32), instances(head=master, ref.master=952881743ed3), libcore(), phabricator(head=master, ref.master=e2cf1e42887a, custom=5), phutil(head=master, ref.master=603209bb1756), services(head=master, ref.master=019a12a6556d)
  #0 AphrontBaseMySQLDatabaseConnection::throwCommonException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:348]
  #1 AphrontBaseMySQLDatabaseConnection::throwQueryCodeException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:289]
  #2 AphrontBaseMySQLDatabaseConnection::throwQueryException(mysqli) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:185]
  #3 AphrontBaseMySQLDatabaseConnection::executeRawQuery(string) called at [<phutil>/src/xsprintf/queryfx.php:8]
  #4 queryfx(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phutil>/src/xsprintf/queryfx.php:13]
  #5 queryfx_all(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phutil>/src/xsprintf/queryfx.php:19]
  #6 queryfx_one(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:143]
  #7 PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:168]
  #8 PhutilLock::lock(integer) called at [<phabricator>/lock.php:6]

So modern behavior here seems reasonably clear. I'll maybe broaden the error message a bit.

epriestley updated the task description. (Show Details)Oct 25 2018, 2:08 AM
epriestley updated the task description. (Show Details)Oct 25 2018, 11:11 AM
epriestley updated the task description. (Show Details)Oct 25 2018, 12:01 PM
epriestley updated the task description. (Show Details)Oct 25 2018, 12:02 PM
epriestley updated the task description. (Show Details)Oct 25 2018, 4:20 PM
epriestley updated the task description. (Show Details)Oct 25 2018, 4:32 PM
epriestley updated the task description. (Show Details)Oct 25 2018, 5:12 PM

See PHI755, which could use verification around WorkingCopy blueprint selection.

PHI755 has more details, but I verified that creating your own WorkingCopyBlueprintImplementation that provides an alternate way to build working-copy resources appears to be working as intended (i.e., this is supported and expected, and providing new types of blueprints which build existing types of resources like host and working-copy is part of the vision of Drydock). Here's a "Land Revision" working with a "+ Magic" working copy blueprint instead of a default upstream blueprint (the two are functionally very similar):

See PHI916. T11908 should move forward to the "migrate 700 callsites" stage, at least.

T11908 describes the current plan on putting multiple database connections on the same actual port-to-port connection.

Since this involves defining a new %R, it might make sense to try to define %P and alternatives for %Q in T6960 at the same time.

And if we're touching everything anyway, trying to fix T12678 sounds good too, maybe.

Of course, this ends up being kind of a lot of extra scope. I'm not sure how involved these steps will actually be.

The %R in T11908 is relatively less involved than everything else so maybe we don't try to do %P, %Q, or "numbers are actually numbers" for now.

epriestley renamed this task from Plans: 2018 Week 41-43 Bonus Content to Plans: 2018 Week 41-44 Bonus Content.Oct 26 2018, 1:09 PM
epriestley updated the task description. (Show Details)Nov 2 2018, 2:00 AM
epriestley renamed this task from Plans: 2018 Week 41-44 Bonus Content to Plans: 2018 Week 41-45 Bonus Content.Nov 6 2018, 12:18 PM
epriestley updated the task description. (Show Details)Nov 6 2018, 12:28 PM
epriestley updated the task description. (Show Details)Nov 6 2018, 12:33 PM
epriestley updated the task description. (Show Details)
epriestley updated the task description. (Show Details)Nov 6 2018, 12:36 PM
epriestley updated the task description. (Show Details)Nov 6 2018, 12:38 PM
epriestley updated the task description. (Show Details)Nov 6 2018, 12:40 PM
epriestley renamed this task from Plans: 2018 Week 41-45 Bonus Content to Plans: 2018 Week 41-44 Bonus Content.Nov 6 2018, 12:50 PM
epriestley closed this task as Resolved.
epriestley updated the task description. (Show Details)