Plans: 2018 Week 41-44 Bonus Content
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Oct 5 2018, 11:19 PM

Description

See PHI806. This is a sensitive request with moderate urgency.

See PHI937, which identifies an issue with packages(project).

See PHI944, where an unusually shaped change is requiring a very large amount of memory to parse.

See PHI936. We could smooth out some language on the Workboards "Point Limit" field when maniphest.points is not configured.

In PHI930, we uncovered a translation string issue:

./bin/phd log --id X
Usage Exception: [Invalid Translation!] The "en_US" language data offers variant translations for the plurality or gender of argument 1, but the value for that argument is not an integer, PhutilNumber, or PhutilPerson (it is a value of type "string"). Raw input: <No daemon(s) with id(s) "%s" exist!>.

See PHI943. We should verify the modern behavior of PhabricatorGlobalLock in the presence of KILL <id> on the GET_LOCK() connection. I believe it is clear and explicit after D19702, which made the catch (...) more narrowly scoped.

See PHI904 for a Windows escaping mess.

See PHI841, which includes a request for richer transaction.search results for audits.

See PHI886, which has some more Java syntax highlighter escaping issues.

Drydock

Revisions and Commits

rPHU libphutil
		D19764	rPHUcf96fd681e7d Support "%R" (Database + Table Ref) in qsprintf(...)
		D19763	rPHUf9a65ebb0e0c Further refine the Java syntax lexer for escaped strings
		D19756	rPHUc2a259398ff4 Provide users a hint that "KILL <process>" is a possible source of connection…
rARC Arcanist
		D19758	rARC637c6584c6ea (experimental) Work around a Windows escaping issue and security conecern in…
		D19758	rARC83661809e532 Work around a Windows escaping issue and security conecern in "hg cat --output .
rP Phabricator
	Accepted		D19746 Make the Harbormaster build plan edit engine forms configurable
		D19771	rPe26c4bddab33 Replace magical "branch" behavior in "diffusion.branchquery" with an explicit…
		D19765	rP798a391e5a90 Add test coverage for "%R" in qsprintf() and convert LiskDAO to support it
		D19762	rPb950f877c50c Allow Drydock Blueprints to control "supplemental allocation" behavior so all…
		D19761	rP57b4b5981947 When a Drydock host based on an Almanac blueprint has its binding disabled…
		D19760	rP65e953658a32 Expose Audit actions for "transaction.search" in a basic way
		D19759	rP61ec43420808 Remove unicode marks for "Accept/Raise Concern" in Audit
		D19755	rPa7c008708d04 Correct a mangled translation string in "bin/phd log --id X"
		D19754	rPf6122547d774 When a lease triggers a resource allocation for a resource which must activate…
		D19753	rP78ab675bd8aa After a Drydock lease triggers a resource to be reclaimed, stop it from…
		D19752	rPe9309fdd6ada When a Drydock lease schedules a resource to be reclaimed, awaken the lease…
		D19750	rP6deb09efcdb4 When leasing a Working Copy in Drydock, just "git reset --hard" so empty…
		D19748	rPe2cf1e42887a Skip copied code detection for changes that are too large for it to be useful
		D19747	rPe0dea4c4862f Fix `packages(project)` to work properly and add it to…
		D19742	rP0a51bc4f05bf Add a space after "View Inline" in mail to prevent double-click on the filename…
		D19744	rP8bffc9ea0efe In "bin/bulk export", require "--output <path>" by default
		D19743	rP4f54d483d5bf Support export of revisions to Excel/CSV/JSON/etc
		D19739	rP4f557ff075ad When using "bin/bulk export --overwrite", actually overwrite the file
		D19738	rP4928c34d00c2 Allow "bin/bulk export" to merge multiple queries and accept more flexible flags

Related Objects

Mentioned In: rP1f6869a7654a: In "bin/drydock lease", take a JSON "--attributes" so we can accept complex…
D19751: In "bin/drydock lease", take a JSON "--attributes" so we can accept complex values
Mentioned Here: T13215: Secure/Admin have many accounts with commercial links in profile blurbs
T6960: Support %P in qsprintf()
T12678: Bring sanity to the world of all MySQL data being string-typed
T12145: Resource allocator does not create new host resources when one is already active
T13212: Add 'ownerPHIDs' query constraint to 'drydock.lease.search' conduit call
T13179: sshd with "ForceUser" does not compile on High Sierra
T13121: Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands
D19102: Implement "Hoax" blueprints in Drydock, like "Host" blueprints but imaginary
T11908: Support an "overlay" database connection mode where multiple applications share a single connection
T13073: Plans: Drydock for normal software use cases where builds take more than 45 seconds
T13088: Plans: Harbormaster UI usability and interconnectedness
D19702: When we fail to acquire a repository lock, try to provide a hint about why
T10093: When builds fail because the staging area was not pushed, make the cause clear
T10964: When a branch in Diffusion is not tracked, the UI should be explicit about it

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Herald added a subscriber: faulconbridge. · View Herald TranscriptOct 5 2018, 11:19 PM

epriestley raised the priority of this task from Low to Normal.Oct 5 2018, 11:19 PM

epriestley added a revision: D19738: Allow "bin/bulk export" to merge multiple queries and accept more flexible flags.Oct 8 2018, 5:19 PM

epriestley added a revision: D19739: When using "bin/bulk export --overwrite", actually overwrite the file.Oct 8 2018, 6:22 PM

epriestley updated the task description. (Show Details)Oct 10 2018, 4:08 PM

epriestley added a commit: rP4928c34d00c2: Allow "bin/bulk export" to merge multiple queries and accept more flexible flags.Oct 10 2018, 4:14 PM

epriestley added a revision: D19742: Add a space after "View Inline" in mail to prevent double-click on the filename from selecting "Inline".Oct 10 2018, 7:44 PM

epriestley added a commit: rP4f557ff075ad: When using "bin/bulk export --overwrite", actually overwrite the file.Oct 11 2018, 3:13 PM

epriestley added a revision: D19743: Support export of revisions to Excel/CSV/JSON/etc.Oct 11 2018, 3:31 PM

epriestley added a revision: D19744: In "bin/bulk export", require "--output <path>" by default.Oct 11 2018, 7:38 PM

epriestley added a commit: rP4f54d483d5bf: Support export of revisions to Excel/CSV/JSON/etc.Oct 11 2018, 8:34 PM

epriestley added a commit: rP8bffc9ea0efe: In "bin/bulk export", require "--output <path>" by default.

epriestley added a commit: rP0a51bc4f05bf: Add a space after "View Inline" in mail to prevent double-click on the filename….Oct 11 2018, 8:44 PM

epriestley renamed this task from Plans: 2018 Week 41 Bonus Content to Plans: 2018 Week 41-42 Bonus Content.Oct 16 2018, 2:32 PM

epriestley added a revision: D19746: Make the Harbormaster build plan edit engine forms configurable.Oct 16 2018, 7:48 PM

epriestley added a revision: D19747: Fix `packages(project)` to work properly and add it to "MailableFunctionDatasource".Oct 19 2018, 11:46 AM

epriestley renamed this task from Plans: 2018 Week 41-42 Bonus Content to Plans: 2018 Week 41-43 Bonus Content.Oct 19 2018, 5:54 PM

epriestley updated the task description. (Show Details)

epriestley updated the task description. (Show Details)Oct 19 2018, 5:59 PM

epriestley updated the task description. (Show Details)Oct 19 2018, 6:22 PM

epriestley updated the task description. (Show Details)

epriestley updated the task description. (Show Details)Oct 19 2018, 6:51 PM

epriestley added a revision: D19748: Skip copied code detection for changes that are too large for it to be useful.Oct 19 2018, 8:27 PM

epriestley added a commit: rPe0dea4c4862f: Fix `packages(project)` to work properly and add it to….Oct 19 2018, 8:53 PM

epriestley added a commit: rPe2cf1e42887a: Skip copied code detection for changes that are too large for it to be useful.Oct 20 2018, 10:36 AM

epriestley updated the task description. (Show Details)Oct 23 2018, 3:30 PM

epriestley updated the task description. (Show Details)Oct 23 2018, 8:24 PM

epriestley updated the task description. (Show Details)

epriestley updated the task description. (Show Details)Oct 23 2018, 8:56 PM

Drydock

I'm hoping to do a relatively small iteration here. The primary goal is to investigate PHI570, which reports some sketchy behavior. I generally want to unblock adopters of Drydock and fix critical weirdness but avoid improving it too much (i.e., avoid a real iteration on T13073) until we can clear other stuff out of the queue.

PHI570 specifically reports that one Host blueprint plus one Working Copy blueprint plus one a limit of 3 does this when building two different repositories:

Allocate R1.
Allocate R1.
Allocate R1.
Build arrives for R2.
Reclaim/destroy resource from (1).
Reclaim/destroy resource from (2).
Reclaim/destroy resource from (3).
Allocate R2.

Steps 1-5 and 8 are expected. Steps (6) and (7) are unexpected and extremely bad.

Building reproduction cases for Drydock stuff is currently a pain. I'd possibly like to bring the "Hoax" blueprints from D19102 upstream, maybe in developer mode only. I'm not sure they'll represent a useful simplification or not, but most outstanding Drydock algorithmic issues seem like they should be reproducible without actually needing real hosts and focus more on prioritization behaviors, locks, races, etc., all of which Hoaxes should be a suitable tool for developing against.

An argument against investing any real effort in Hoaxes is that I don't think I'll ever be confident enough with them to push back on an issue with a claim that I probably failed to reproduce it against Hoaxes. Another avenue would be to invest in a more robust EC2 blueprint and reduce the cost of building a real reproduction case. However, I'd like to avoid that for now.

From previous work, I already have functional Host and Working Copy blueprints locally so I'm likely punting on the Hoax/EC2 issue.

First up, bin/drydock lease makes it very hard to build ad-hoc working copy leases because it takes --attributes x=y,a=b but Working Copy resources are specified with a complex repositories.map. I think --attributes is underpowered and am just going to replace it with --attributes <path-to-JSON-file>. This makes working copies reasonable to bin/drydock lease and generally makes testing/debugging/repeating things a bit easier since you can collect lease attributes in one place. I'll upstream this shortly.

Next, I hit the ssh -o LogLevel=quiet issue from T13121#240184 again. I don't really have a great way forward here but -o LogLevel=quiet sure is garbage. Perhaps we'll give you an option to configure it and use StrictHostKeyChecking=no with stderr filtering if you don't. This is bad but dramatically less bad than -o LogLevel=quiet which is deeply horrible. We don't have a good way to proxy/filter stderr right now but I hate -o LogLevel=quiet so much that I may build one and upstream it. One argument in favor of this is that we could apply a similar filter to DiffusionMercurialCommandEngine::filterMercurialDebugOutput(...), which has a similar effect but can not process streams and can not be configured directly on the ExecFuture.

(We have FutureProxy, but it isn't powerful enough to proxy a full ExecFuture right now. I'm inclined to think that setStdout/StderrSink() is perhaps a better approach here than actually proxying the whole future.)

Next, I hit T13179 again. Yeesh. I rebased on my local hack-around for now.

With one Host blueprint and one Working Copy blueprint, I leased three spellbook working copies. Then I released the leases.

While releasing the leases, I accidentally released the wrong lease. This left a working copy resource without a host lease. We should have better safety checks around this; when the lease holder is a Drydock object we can refuse to release it outside of resource cleanup. I released all the Working Copy resources and rebuilt them, then released the correct leases.

NOTE: These leases take about 18s to activate -- we're likely missing a wakeup-from-yield somewhere?

So, starting state is like this:

Screen Shot 2018-10-23 at 5.49.15 PM.png (231×314 px, 19 KB)

Next, I leased a koans working copy. This appeared to be working:

<Waiting For Resource> Waiting for available resources from: Blueprint 2: Working Copies.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting For Resource> Waiting for available resources from: Blueprint 2: Working Copies.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Reclaimed Resources> Reclaimed resource Resource 64: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
...

The time spent in the first two cases ("Waiting for available resources...") is expected. We have a 3-minute backoff after update before a resource can be reclaimed.

However, it then failed:

...
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

<Lease Destroyed> Lease destroyed.
[2018-10-23 17:50:56] EXCEPTION: (Exception) Lease has already been destroyed! at [<phabricator>/src/applications/drydock/management/DrydockManagementLeaseWorkflow.php:198]
...

...because koans is empty and we run ... && git fetch && git reset --hard HEAD, but git reset --hard HEAD does not function in an empty repository because there is no HEAD. It looks like we can fix this by using git reset --hard instead of git reset --hard HEAD; the former has effectively the same meaning but works when the repository is empty. I'll upstream this change.

This failure happens during lease activation; the resource still builds. So now I have two spellbook working copies and one koans working copy.

I'm going to try to lease koans again (this is with the --reset patch):

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/99/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 99

Lease activated in 3,096ms.

So far so good. Let's try grabbing another koans lease without releasing anything. So the expectation is: one leased koans resource, two unleased spellbook resources. We should immediately reclaim a spellbook and turn it into a koans resource:

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/100/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Reclaimed Resources> Reclaimed resource Resource 65: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 100

Lease activated in 34,167ms.

That took longer than it should have (missing wakeups, per above?) but worked, reclaiming one of the spellbook resources and turning it into a koans.

Let's try the same thing again, this should convert the third resource:

$ ./bin/drydock lease --type working-copy --attributes koans.json 
Queued lease for activation:

        http://local.phacility.com/drydock/lease/102/

Waiting for daemons to activate lease...
<Lease Queued> Lease queued for acquisition.
<Reclaimed Resources> Reclaimed resource Resource 66: Working Copy.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Waiting for Activation> Waiting 15 seconds for lease to activate.
<Lease Acquired> Lease acquired.
<Lease Activated> Lease activated.
Activation complete. This lease is permanent until manually released with:

        $ ./bin/drydock release-lease --id 102

Lease activated in 34,209ms.

So that looks good.

The missing wakeup is likely that DrydockWorker->reclaimResource() does not awaken the task which started the reclaim. It reasonably can, and should. This isn't a completely trivial change so I'm unsure about upstreaming it, but it wouldn't be terribly difficult to fix.

My theory on the Drydock behavior is that we're getting into an "A, A, A" state. A lease comes in that needs a "B", so it releases a resource.

If the release + destroy take more than 15 seconds, the "B" lease may get to wake up again and release another resource. If they take more than 30 seconds, it may get to do it again, and so on.

When a lease triggers a resource release, I'm going to stop it from triggering another release until the first release completes.

epriestley added a revision: D19750: When leasing a Working Copy in Drydock, just "git reset --hard" so empty repositories work.Oct 24 2018, 1:11 PM

epriestley mentioned this in D19751: In "bin/drydock lease", take a JSON "--attributes" so we can accept complex values.Oct 24 2018, 2:22 PM

epriestley added a revision: D19752: When a Drydock lease schedules a resource to be reclaimed, awaken the lease update task when the reclaim completes.Oct 24 2018, 2:59 PM

epriestley added a revision: D19753: After a Drydock lease triggers a resource to be reclaimed, stop it from triggering another reclaim until the first one completes.Oct 24 2018, 4:15 PM

epriestley added a revision: D19754: When a lease triggers a resource allocation for a resource which must activate, awaken the lease task after the resource activates.Oct 24 2018, 4:33 PM

epriestley added a revision: D19755: Correct a mangled translation string in "bin/phd log --id X".Oct 24 2018, 4:41 PM

epriestley updated the task description. (Show Details)Oct 24 2018, 5:31 PM

epriestley updated the task description. (Show Details)Oct 24 2018, 5:35 PM

See PHI943. We should verify the modern behavior of PhabricatorGlobalLock in the presence of KILL <id> on the GET_LOCK() connection. I believe it is clear and explicit after D19702, which made the catch (...) more narrowly scoped.

I used this script:

lock.php

<?php

require_once 'scripts/init/init-script.php';

$lock1 = PhabricatorGlobalLock::newLock('test');
$lock1->lock(120);

echo "Got lock. Press ^C to exit.\n";
while (true) {
  sleep(1);
}

I ran it in two windows, then used KILL <id> to kill the process trying to get the lock (the script from the second window) three times (because we retry on lost connection three times). That produced this explicit connection loss exception:

[2018-10-24 10:35:20] EXCEPTION: (AphrontConnectionLostQueryException) #2006: MySQL server has gone away

This error may occur if your MySQL 'wait_timeout' or 'max_allowed_packet' configuration values are set too low. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:304]
arcanist(head=experimental, ref.master=2650e8627a20, ref.experimental=9d59e9e5908e), corgi(head=master, ref.master=6371578c9d32), instances(head=master, ref.master=952881743ed3), libcore(), phabricator(head=master, ref.master=e2cf1e42887a, custom=5), phutil(head=master, ref.master=603209bb1756), services(head=master, ref.master=019a12a6556d)
  #0 AphrontBaseMySQLDatabaseConnection::throwCommonException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:348]
  #1 AphrontBaseMySQLDatabaseConnection::throwQueryCodeException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:289]
  #2 AphrontBaseMySQLDatabaseConnection::throwQueryException(mysqli) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:185]
  #3 AphrontBaseMySQLDatabaseConnection::executeRawQuery(string) called at [<phutil>/src/xsprintf/queryfx.php:8]
  #4 queryfx(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phutil>/src/xsprintf/queryfx.php:13]
  #5 queryfx_all(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phutil>/src/xsprintf/queryfx.php:19]
  #6 queryfx_one(AphrontMySQLiDatabaseConnection, string, string, double) called at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:143]
  #7 PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:168]
  #8 PhutilLock::lock(integer) called at [<phabricator>/lock.php:6]

So modern behavior here seems reasonably clear. I'll maybe broaden the error message a bit.

epriestley added a revision: D19756: Provide users a hint that "KILL <process>" is a possible source of connection loss.Oct 24 2018, 5:42 PM

epriestley added a revision: D19758: Work around a Windows escaping issue and security conecern in "hg cat --output ...".Oct 25 2018, 2:01 AM

epriestley updated the task description. (Show Details)Oct 25 2018, 2:08 AM

epriestley updated the task description. (Show Details)Oct 25 2018, 11:11 AM

epriestley added a revision: D19759: Remove unicode marks for "Accept/Raise Concern" in Audit.Oct 25 2018, 11:53 AM

epriestley updated the task description. (Show Details)Oct 25 2018, 12:01 PM

epriestley added a revision: D19760: Expose Audit actions for "transaction.search" in a basic way.Oct 25 2018, 12:01 PM

epriestley updated the task description. (Show Details)Oct 25 2018, 12:02 PM

epriestley added a revision: D19761: When a Drydock host based on an Almanac blueprint has its binding disabled, stop handing out leases.Oct 25 2018, 12:48 PM

epriestley added a revision: D19762: Allow Drydock Blueprints to control "supplemental allocation" behavior so all hosts in an Almanac pool get used.Oct 25 2018, 1:58 PM

epriestley updated the task description. (Show Details)Oct 25 2018, 4:20 PM

epriestley added a revision: D19763: Further refine the Java syntax lexer for escaped strings.Oct 25 2018, 4:32 PM

epriestley updated the task description. (Show Details)Oct 25 2018, 4:32 PM

See PHI755, which could use verification around WorkingCopy blueprint selection.

PHI755 has more details, but I verified that creating your own WorkingCopyBlueprintImplementation that provides an alternate way to build working-copy resources appears to be working as intended (i.e., this is supported and expected, and providing new types of blueprints which build existing types of resources like host and working-copy is part of the vision of Drydock). Here's a "Land Revision" working with a "+ Magic" working copy blueprint instead of a default upstream blueprint (the two are functionally very similar):

Screen Shot 2018-10-25 at 10.00.02 AM.png (882×1 px, 121 KB)

See PHI916. T11908 should move forward to the "migrate 700 callsites" stage, at least.

T11908 describes the current plan on putting multiple database connections on the same actual port-to-port connection.

Since this involves defining a new %R, it might make sense to try to define %P and alternatives for %Q in T6960 at the same time.

And if we're touching everything anyway, trying to fix T12678 sounds good too, maybe.

Of course, this ends up being kind of a lot of extra scope. I'm not sure how involved these steps will actually be.

The %R in T11908 is relatively less involved than everything else so maybe we don't try to do %P, %Q, or "numbers are actually numbers" for now.