Page MenuHomePhabricator

Plans: Harbormaster Interactions with Differential / Chaos Reduction
Closed, ResolvedPublic

Description

Not actually very chaotic:

  • See PHI281. See T10746. See T11154. Aborted builds are not propagating up to influence buildable status properly.
  • See T8475. This has been in the wild for a year.
  • See PHI346. This is a poor interaction between drafts and --plan-changes.

Race-related chaos:

  • See PHI309. Possible race between Harbormaster and Differential, or maybe a weird interaction between autoplans and drafts.
  • See PHI302. Probably a race between Differential and Harbormaster?
  • See PHI268. This looks like a race. We could also reasonably publish more build information into the timeline.
  • See PHI332. See T12714. This is almost certainly an easily reproducible race which can be avoided without much work.

Chaos I can't reproduce:

  • See PHI335. This reports an Owners package being added for the wrong repository.

Errata:

  • Buildables stuck in "building" despite an aborted build should get a best-effort migration if nothing else moots that (T10746).
  • BuildCommand is now just a bad version of BuildMessage.
  • arc diff --only sticks in "preparing" forever.

Event Timeline

epriestley triaged this task as Normal priority.Jan 30 2018, 8:53 PM
epriestley created this task.
epriestley updated the task description. (Show Details)Jan 31 2018, 4:48 PM
epriestley renamed this task from Plans: Harbormaster Interactions with Differential to Plans: Harbormaster Interactions with Differential / Chaos Reduction.Feb 10 2018, 7:09 PM
epriestley updated the task description. (Show Details)Feb 10 2018, 7:13 PM
epriestley updated the task description. (Show Details)
epriestley updated the task description. (Show Details)Feb 10 2018, 7:35 PM
epriestley updated the task description. (Show Details)Feb 10 2018, 7:50 PM

On the races:

PHI332 / T12714 describe cases where two transactions from different sources appear in the transaction log at the same time, but the edit for one fails to apply. These are almost certainly simple races where two editors of the same type edit the same object at the same time (e.g., API edit conflicts with web edit). We don't currently read-lock most objects, and over time it's more likely that races will happen since edits have become a little more complex, there are more ways to trigger edits (Herald, API, bulk editor, Harbormaster), and more total edits occur. This is likely very easy to reproduce by, e.g., adding a sleep(5) between the edit start and commit and firing two conflicting edits off with the API at the same time. Read locking everything (proposed in T12714) is one likely fix; formally locking objects with GlobalLock is another.

The next three likely arise from issues at the Harbormaster/Differential boundary. They may be entangled. It is possible these are special cases of PHI332.

  • PHI268 describes a case where a build failed to get a containerPHID, and encountered some other irregularities. I think this is likely also a race, but may be more difficult to pin down. I plan to begin by reviewing the containerPHID code, which I recall being suspicious.
  • PHI302 describes a case where a build "passed" before all the targets generated. This may or may not be a race.
  • PHI309 describes a case where a build with failing auto-targets (lint/unit with excuses) does not automatically leave draft mode and doesn't get a build status message even though the build transitions to a completed state.

Finally, PHI312 has a detailed description of a Drydock race which is probably not related to the other situations.

For PHI332, my test case was:

  • Add sleep(10) to PhabricatorApplicationTransactionEditor->applyTransactions() before applyInitialEffects().
  • In two windows, submit maniphest.edit API calls via the web UI, one applying a title edit and one applying a description edit.
  • Send them at about the same time.
  • Observe: they both hang for about 10 seconds, they both appear in the transaction log, but only one actually applies.

In making read locks unconditional, I'm removing this method:

/**
 * Determine if the editor should hold a read lock on the object while
 * applying a transaction.
 *
 * If the editor does not hold a lock, two editors may read an object at the
 * same time, then apply their changes without any synchronization. For most
 * transactions, this does not matter much. However, it is important for some
 * transactions. For example, if an object has a transaction count on it, both
 * editors may read the object with `count = 23`, then independently update it
 * and save the object with `count = 24` twice. This will produce the wrong
 * state: the object really has 25 transactions, but the count is only 24.
 *
 * Generally, transactions fall into one of four buckets:
 *
 *   - Append operations: Actions like adding a comment to an object purely
 *     add information to its state, and do not depend on the current object
 *     state in any way. These transactions never need to hold locks.
 *   - Overwrite operations: Actions like changing the title or description
 *     of an object replace the current value with a new value, so the end
 *     state is consistent without a lock. We currently do not lock these
 *     transactions, although we may in the future.
 *   - Edge operations: Edge and subscription operations have internal
 *     synchronization which limits the damage race conditions can cause.
 *     We do not currently lock these transactions, although we may in the
 *     future.
 *   - Update operations: Actions like incrementing a count on an object.
 *     These operations generally should use locks, unless it is not
 *     important that the state remain consistent in the presence of races.
 *
 * @param   PhabricatorLiskDAO  Object being updated.
 * @param   PhabricatorApplicationTransaction Transaction being applied.
 * @return  bool                True to synchronize the edit with a lock.
 */
protected function shouldReadLock(
  PhabricatorLiskDAO $object,
  PhabricatorApplicationTransaction $xaction) {
  return false;
}

Edits in this class (like title and description edits) are "overwrite operations". Although many edits aren't overwrites (e.g., adding comments is an "append"), there are a whole lot of different types of overwrites, so I'm just making read locks unconditional. If they cause some kind of issue (very unusual edits that can somehow end up in a deadlock?) we could let transactions opt out of them.

To verify the behavior of this change, I did this:

  • With the sleep(10) in place:
  • In two windows, submit maniphest edit API calls in the web UI, one with a title edit and one with a description edit.
  • Send them at about the same time.
  • Observe:
    • They now apply serially: the first one completes after 10 seconds; the second after 20 seconds.
    • While they are applying, the task can be reloaded freely; the first edit appears after 10 seconds.
    • Both edits appear in the transaction log and the task properties.

I also fiddled around with FOR UPDATE (this is what the "read lock" mechanism is, internally) queries to make sure I was remembering their behavior properly.

The consequences I expect as a result of this change are:

  • Edits which hit the lock will execute more slowly because they now execute sequentially: if two users submitted simultaneous changes, both edits would previously apply simultaneously. They will now apply sequentially, so the second user will wait (normally, this should only be a few hundred milliseconds at worst) for the first edit to complete. I expect these to be relatively rare and the additional latency to escape notice by users, but we could experiment with opting the common known-append-only edits (comment, subscribe/edge) out of the locks if this introduces perceptible slowness. The scope of writes these transactions perform could also be reduced.
  • These locks only protect cooperating actors. Although these are a dying breed, some endpoints perform writes directly and just call save(). If an edit doesn't use read locks itself (i.e., does not SELECT ... FOR UPDATE), it is allowed to select immediately, without waiting. These edits can still cause fields to be overwritten by reading a stale version of the object, then attempting to update it after the locked update applies and restoring the object to the old state. These edits are rare, and probably should not exist in most cases, so the likely remedy is to upgrade them to modern transactions.
  • It is unlikely, but possible, that this will introduce deadlocks. A deadlock would probably look like this:
    • A user mentions Txxx on Dxxx.
    • The Dxxx editor locks the revision row FOR UPDATE.
    • The sub-editor making the inverse update for "Txxx was mentioned on Dxxx" locks the task for FOR UPDATE. Today, this lock is always on a separate connection in a separate transaction.
    • The task sub-editor applies an uncoordinated edit to the revision row for some reason. Offhand, I can't think of any reason why this would happen.
    • This deadlocks: the first connection has the revision row locked FOR UPDATE and the second connection is waiting on that lock to clear before its update fires. This is an application-side deadlock and MySQL will not be able to detect it, so the request will appear to hang until the query timeout (or execution timeout) is reached.

Unit tests caught one issue with D19059 in the password upgrade engine. The code looked like this:

$object->setSomeProperty('abc');

// ....

$editor->applyTransactions($object, $xactions);

That is, we mutated the object outside of the Editor, then relied on the Editor to save the mutation.

This no longer works with a read lock, because the object is reloaded inside the editor and unsaved changes from earlier in the process are discarded.

Although it's possible we have more of these, I believe this flow is particularly unusual. In this case, adding $object->save(); after the mutation was a reasonable approach, although any other cases which turn up may be better tackled in some other way (ideally, by making the mutation happen inside the transaction engine).

For the Harbormaster races, I'm looking at the containerPHID case first since this is a relatively concrete clue (PHI268).

The containerPHID for diff builds is currently set in DifferentialTransactionEditor as a side effect of performing the DifferentialTransaction::TYPE_UPDATE transaction. Here's the suspicious block:

// TODO: This can race with diff updates, particularly those from
// Harbormaster. See discussion in T8650.
$diff->setRevisionID($object->getID());
$diff->save();

// Update Harbormaster to set the containerPHID correctly for any
// existing buildables. We may otherwise have buildables stuck with
// the old (`null`) container.

// TODO: This is a bit iffy, maybe we can find a cleaner approach?
// In particular, this could (rarely) be overwritten by Harbormaster
// workers.
$table = new HarbormasterBuildable();
$conn_w = $table->establishConnection('w');
queryfx(
  $conn_w,
  'UPDATE %T SET containerPHID = %s WHERE buildablePHID = %s',
  $table->getTableName(),
  $object->getPHID(),
  $diff->getPHID());

Since containerPHID wasn't populated, my guess is that this code raced with something else modifying the Buildable:

THREAD A              THREAD B (DifferentialTransactionEditor)
--------              --------

Load Buildable
                      Update container PHID
Save Buildable

The only possible identity for "THREAD A" should be HarbormasterBuildEngine.

At HEAD, DifferentialTransactionEditor holds a read lock for the revision. HarbormasterBuildEngine holds a global lock for the Buildable. Neither respect the other's lock.

The behavior of DifferentialTransactionEditor is in the wrong here from a lock perspective.

But, I'm confused about why we need the containerPHID update at all in the modern workflow. When the buildable is created by Herald, the containerPHID should always be available at creation time. Although it's possible to create a buildable directly against a diff via the CLI or via "Run Build", that comes out as a manual buildable and the buildable in question isn't a manual buildable.

I'm going to try to hunt this down. My expectation is that:

  • All normal buildables should be created with containerPHID set properly already in the modern code. The data referenced by PHI268 implies this isn't true.
  • We can just get rid of this sketchy containerPHID update: if you run a manual build against a diff without a revision attached, it's OK to not give it a container retroactively if the diff is later attached.

All normal buildables should be created with containerPHID set properly already in the modern code.

Ah, no, this isn't true: the buildable is actually created by harbormaster.queryautotargets reporting lint and unit results. No revision exists yet so containerPHID can't be set.

It's still unclear to me how we can process TYPE_UPDATE while HarbormasterBuildEngine is running and before the Buildable has a containerPHID. harbormaster.queryautotargets can end up running HarbormasterBuildEngine, but arc runs it synchronously before running differential.createrevision synchronously.

The first TYPE_UPDATE should set the containerPHID. No builds should be able to queue yet.

For PHI268, here are the actual diffs for the affected revision:

+--------------------------------+-------------+----------------------------+---------------+
| phid                           | dateCreated | FROM_UNIXTIME(dateCreated) | ModifiedDelta |
+--------------------------------+-------------+----------------------------+---------------+
| PHID-DIFF-6d5qhvwsuj4gna6nkjcc |  1513760690 | 2017-12-20 09:04:50        |             7 |
| PHID-DIFF-zbim6tobl5qu5zwkuxbv |  1514331282 | 2017-12-26 23:34:42        |             7 |
| PHID-DIFF-hyj2enzzeg526bsrkubm |  1514851224 | 2018-01-02 00:00:24        |            17 |
+--------------------------------+-------------+----------------------------+---------------+

Here's the transaction log:

+--------------------------------+-------------+----------------------------+---------------------------------+----------------------------------+
| phid                           | dateCreated | FROM_UNIXTIME(dateCreated) | transactionType                 | diffPHID                         |
+--------------------------------+-------------+----------------------------+---------------------------------+----------------------------------+
| PHID-XACT-DREV-gqjmuryttkv5f37 |  1513760697 | 2017-12-20 09:04:57        | core:create                     | NULL                             |
| PHID-XACT-DREV-ahx57rbbp47reo2 |  1513760697 | 2017-12-20 09:04:57        | differential:update             | "PHID-DIFF-6d5qhvwsuj4gna6nkjcc" |
| PHID-XACT-DREV-zcz4qv76us6bhw4 |  1513760697 | 2017-12-20 09:04:57        | differential.revision.title     | NULL                             |
| PHID-XACT-DREV-wztl3koqegfohbb |  1513760697 | 2017-12-20 09:04:57        | differential.revision.summary   | NULL                             |
| PHID-XACT-DREV-nskyist3zb2gpfq |  1513760697 | 2017-12-20 09:04:57        | differential.revision.testplan  | NULL                             |
| PHID-XACT-DREV-wypzzqbxydiawco |  1513760697 | 2017-12-20 09:04:57        | differential.revision.reviewers | NULL                             |
| PHID-XACT-DREV-5e7alj5wb66o3ll |  1513760697 | 2017-12-20 09:04:57        | differential.revision.reviewers | NULL                             |
| PHID-XACT-DREV-vswc3ndufs6d7bf |  1513766449 | 2017-12-20 10:40:49        | differential.revision.accept    | NULL                             |
| PHID-XACT-DREV-cjy46vmozi6rebr |  1513766449 | 2017-12-20 10:40:49        | core:comment                    | NULL                             |
| PHID-XACT-DREV-ilijivaptyyjxxh |  1513766449 | 2017-12-20 10:40:49        | differential.revision.status    | NULL                             |
| PHID-XACT-DREV-o7jumkx24567lrw |  1513772995 | 2017-12-20 12:29:55        | differential.revision.summary   | NULL                             |
| PHID-XACT-DREV-wrqfosroahov7ds |  1513773009 | 2017-12-20 12:30:09        | differential.revision.summary   | NULL                             |
| PHID-XACT-DREV-bdqlkhtwlh2gnyy |  1513801899 | 2017-12-20 20:31:39        | differential.revision.reviewers | NULL                             |
| PHID-XACT-DREV-t4nl47hjow537oy |  1513868478 | 2017-12-21 15:01:18        | differential.revision.reject    | NULL                             |
| PHID-XACT-DREV-zbqk77rfmhrcrvr |  1513868478 | 2017-12-21 15:01:18        | core:comment                    | NULL                             |
| PHID-XACT-DREV-th6xlarzyev56cx |  1513868478 | 2017-12-21 15:01:18        | differential.revision.status    | NULL                             |
| PHID-XACT-DREV-tr2hlfeuabaf7lr |  1514331278 | 2017-12-26 23:34:38        | core:comment                    | NULL                             |
| PHID-XACT-DREV-tvnztdueh7b6s5b |  1514331289 | 2017-12-26 23:34:49        | differential:update             | "PHID-DIFF-zbim6tobl5qu5zwkuxbv" |
| PHID-XACT-DREV-q5uzt23zey7gz6n |  1514331289 | 2017-12-26 23:34:49        | differential.revision.void      | NULL                             |
| PHID-XACT-DREV-3e5j27op5qiorfj |  1514331289 | 2017-12-26 23:34:49        | core:comment                    | NULL                             |
| PHID-XACT-DREV-tdxucmjf72edc7e |  1514331289 | 2017-12-26 23:34:49        | harbormaster:buildable          | NULL                             |
| PHID-XACT-DREV-gd5rtzacggx2omi |  1514331563 | 2017-12-26 23:39:23        | differential.revision.resign    | NULL                             |
| PHID-XACT-DREV-f5yxutvk3ocun5b |  1514331563 | 2017-12-26 23:39:23        | differential.revision.status    | NULL                             |
| PHID-XACT-DREV-a3gzmm56kllk7wm |  1514589297 | 2017-12-29 23:14:57        | differential.revision.accept    | NULL                             |
| PHID-XACT-DREV-avjvisqo3u7wmhp |  1514589297 | 2017-12-29 23:14:57        | core:comment                    | NULL                             |
| PHID-XACT-DREV-vcbfopptew6hznk |  1514851241 | 2018-01-02 00:00:41        | differential:update             | "PHID-DIFF-hyj2enzzeg526bsrkubm" |
| PHID-XACT-DREV-4kiw2xpgxygrxtq |  1514851241 | 2018-01-02 00:00:41        | differential.revision.close     | NULL                             |
+--------------------------------+-------------+----------------------------+---------------------------------+----------------------------------+

Here's the additional data on the harbormaster.buildable update:

oldValue: "building"
newValue: "passed"
metadata: {"harbormaster:buildablePHID":"PHID-HMBB-dfscxfciclben4inwcsa"}

Here are the buildables:

+--------------------------------+--------------------------------+--------------------------------+-----------------+-------------+----------------------------+--------------+-----------------------------+-------------------+
| phid                           | buildablePHID                  | containerPHID                  | buildableStatus | dateCreated | FROM_UNIXTIME(dateCreated) | dateModified | FROM_UNIXTIME(dateModified) | isManualBuildable |
+--------------------------------+--------------------------------+--------------------------------+-----------------+-------------+----------------------------+--------------+-----------------------------+-------------------+
| PHID-HMBB-clmqvcmwy5yww6b3vdxt | PHID-DIFF-6d5qhvwsuj4gna6nkjcc | PHID-DREV-uhp5ct4fhkidcva7sq3t | building        |  1513760696 | 2017-12-20 09:04:56        |   1513760702 | 2017-12-20 09:05:02         |                 0 |
| PHID-HMBB-dfscxfciclben4inwcsa | PHID-DIFF-zbim6tobl5qu5zwkuxbv | NULL                           | building        |  1514331288 | 2017-12-26 23:34:48        |   1514331291 | 2017-12-26 23:34:51         |                 0 |
+--------------------------------+--------------------------------+--------------------------------+-----------------+-------------+----------------------------+--------------+-----------------------------+-------------------+

Here are the builds:

+--------------------------------+--------------------------------+-------------+-------------+----------------------------+--------------+-----------------------------+-------------+--------------------------------+
| phid                           | buildablePHID                  | buildStatus | dateCreated | FROM_UNIXTIME(dateCreated) | dateModified | from_unixtime(dateModified) | planAutoKey | initiatorPHID                  |
+--------------------------------+--------------------------------+-------------+-------------+----------------------------+--------------+-----------------------------+-------------+--------------------------------+
| PHID-HMBD-bcfkmcfntmflqmj6simq | PHID-HMBB-clmqvcmwy5yww6b3vdxt | building    |  1513760697 | 2017-12-20 09:04:57        |   1513760702 | 2017-12-20 09:05:02         | NULL        | PHID-HRUL-max3hepcntcxmwc3pv2g |
| PHID-HMBD-njcujb6as3bgqwkdvcn5 | PHID-HMBB-clmqvcmwy5yww6b3vdxt | passed      |  1513760696 | 2017-12-20 09:04:56        |   1513760697 | 2017-12-20 09:04:57         | arcanist    | PHID-USER-ihd52hbqxget7abquort |
| PHID-HMBD-dthtlz4kjnfy4borlnyg | PHID-HMBB-dfscxfciclben4inwcsa | building    |  1514331289 | 2017-12-26 23:34:49        |   1514331291 | 2017-12-26 23:34:51         | NULL        | PHID-HRUL-max3hepcntcxmwc3pv2g |
| PHID-HMBD-4mphhemrbv72c2fj34hj | PHID-HMBB-dfscxfciclben4inwcsa | passed      |  1514331288 | 2017-12-26 23:34:48        |   1514331289 | 2017-12-26 23:34:49         | arcanist    | PHID-USER-ihd52hbqxget7abquort |
+--------------------------------+--------------------------------+-------------+-------------+----------------------------+--------------+-----------------------------+-------------+--------------------------------+

Here are the targets:

+--------------------------------+--------------------------------+------------------------------------------------+----------------+-------------+---------------+-------------+--------------+
| phid                           | buildPHID                      | className                                      | targetStatus   | dateStarted | dateCompleted | dateCreated | dateModified |
+--------------------------------+--------------------------------+------------------------------------------------+----------------+-------------+---------------+-------------+--------------+
| PHID-HMBT-6ore6jvmcvd2a7fswq2i | PHID-HMBD-4mphhemrbv72c2fj34hj | HarbormasterArcUnitBuildStepImplementation     | target/passed  |  1514331288 |    1514331289 |  1514331288 |   1514331289 |
| PHID-HMBT-ct332j546ngz3yut2ue2 | PHID-HMBD-4mphhemrbv72c2fj34hj | HarbormasterArcLintBuildStepImplementation     | target/passed  |  1514331288 |    1514331289 |  1514331288 |   1514331289 |
| PHID-HMBT-y4iwg5rskps5f5w6acwl | PHID-HMBD-bcfkmcfntmflqmj6simq | HarbormasterHTTPRequestBuildStepImplementation | target/waiting |  1513760703 |          NULL |  1513760702 |   1513760703 |
| PHID-HMBT-7spw5xvdp3kych6rd7h2 | PHID-HMBD-dthtlz4kjnfy4borlnyg | HarbormasterHTTPRequestBuildStepImplementation | target/waiting |  1514331291 |          NULL |  1514331291 |   1514331291 |
| PHID-HMBT-dc3qwluq6ssqxwlizflj | PHID-HMBD-njcujb6as3bgqwkdvcn5 | HarbormasterArcUnitBuildStepImplementation     | target/passed  |  1513760696 |    1513760697 |  1513760696 |   1513760697 |
| PHID-HMBT-arbphzdpisif44ykiclh | PHID-HMBD-njcujb6as3bgqwkdvcn5 | HarbormasterArcLintBuildStepImplementation     | target/passed  |  1513760696 |    1513760697 |  1513760696 |   1513760697 |
+--------------------------------+--------------------------------+------------------------------------------------+----------------+-------------+---------------+-------------+--------------+

Here are the messages:

+-------+--------------------------------+--------------------------------+------+------------+-------------+--------------+
| id    | authorPHID                     | buildTargetPHID                | type | isConsumed | dateCreated | dateModified |
+-------+--------------------------------+--------------------------------+------+------------+-------------+--------------+
| 53857 | PHID-USER-ihd52hbqxget7abquort | PHID-HMBT-6ore6jvmcvd2a7fswq2i | pass |          1 |  1514331289 |   1514331289 |
| 53295 | PHID-USER-ihd52hbqxget7abquort | PHID-HMBT-arbphzdpisif44ykiclh | pass |          1 |  1513760696 |   1513760697 |
| 53856 | PHID-USER-ihd52hbqxget7abquort | PHID-HMBT-ct332j546ngz3yut2ue2 | pass |          1 |  1514331289 |   1514331289 |
| 53296 | PHID-USER-ihd52hbqxget7abquort | PHID-HMBT-dc3qwluq6ssqxwlizflj | pass |          1 |  1513760696 |   1513760697 |
+-------+--------------------------------+--------------------------------+------+------------+-------------+--------------+

Here are the irregularities that jump out to me for the PHI268 data:

The first build never completed. This is consistent with the remote build system just never responding. It's not immediately clear from the original report whether the remote build system claims it responded or not. If we assume it did not, this is explicable.

The second build reported a pass even though the target remained in "waiting". The pass was reported 1 second after the buildable was created. This was two seconds before the build target was created:

EventPHIDEpochDelta
Buildable CreatedPHID-HMBB-dfscxfciclben4inwcsa1514331288+0
Buildable Pass1514331289+1
Build CreatedPHID-HMBD-dthtlz4kjnfy4borlnyg1514331289+1
Target CreatedPHID-HMBT-7spw5xvdp3kych6rd7h21514331291+3

The clock doesn't have enough resolution to determine if "Build Pass" or "Build Created" happened first.

This is consistent with the report in PHI302. There, the build reported a pass immediately and the remote build system actually responded 151 seconds later. For PHI302, the sequence was:

EventPHIDEpochDelta
Buildable CreatedPHID-HMBB-a5aif4jaqffe7nfcwhxe1516327162+0
Build CreatedPHID-HMBD-rik5mjpye4smug277yp51516327163+1
Buildable Pass1516327164+2
Target CreatedPHID-HMBT-jn3wm7wx3g22lfc7tegx1516327165+3
Remote Message Received1516327315+153
Build Fail1516327316+154

The sequence there suggests that the build creation happened first, although this isn't necessarily reliable.

In both cases, the builds are being processed before all build/target creation finishes. We likely lack a state flag to indicate "builds are still queueing for this buildable". This is probably easy to reproduce by putting a sleep() before build creation, creating a revision, and then manually running bin/harbormaster update <buildable> in a second window.

The second irregularity is that containerPHID for the second buildable is NULL. This remains perplexing to me.

The NULL is explicable if we can somehow queue a HarbormasterBuildWorker between the time it is created by harbormaster.queryautotargets and the time the TYPE_UPDATE applies.

And, as it turns out, this can happen in harbormaster.sendmessage.

So I think this is what's happening:

  1. arc calls harbormaster.queryautotargets which creates the buildable with no containerPHID yet (since the revision doesn't exist). This runs a BuildWorker inline, but it's synchronous and the action by arc is also synchronous, so nothing can race yet. arc is our first thread.
  2. arc calls harbormaster.sendmessage to update the build with lint and unit results. After posting the messages, the API method queues a HarbormasterBuildWorker to actually apply the updates. This is in the daemon queue asynchronously and can execute at any time, so this is our second thread.
  3. arc now calls differential.createrevision. This is synchronous with arc so it's conceptually still in the first thread. This applies the TYPE_UPDATE, sets the containerPHID, may trigger more builds, etc.

If the BuildWorker from step (2) executes before step (3), nothing bad happens (except: maybe it interacts weirdly with drafts).

If the BuildWorker from step (2) executes after step (3), nothing bad happens.

If the BuildWorker starts executing, then step (3) fires while it's executing, we can get a NULL containerPHID. We can also get a phantom "passed" build where lint and unit have passed but real builds haven't triggered yet. This can happen even if containerPHID is not set because we identify where to publish the build by going through the Diff, which may have the correct revisionID even when the Buildable has no containerPHID.

So I think this explains the behavior in PHI268 and PHI302, if combined with the remote build system legitimately failing to make a callback for the build in PHI268, which is plausible since that issue involved two builds and all evidence points to the first one legitimately not reporting back.

epriestley added a comment.EditedFeb 12 2018, 4:51 PM

This deadlocks: the first connection has the revision row locked FOR UPDATE and the second connection is waiting on that lock to clear before its update fires. This is an application-side deadlock and MySQL will not be able to detect it, so the request will appear to hang until the query timeout (or execution timeout) is reached.

This actually can't deadlock without further mischief: the non-coordinating update still goes over the first connection, which is the connection with the FOR UPDATE lock, so this will complete.

This change still risks deadlocks in theory, but I'm hard-pressed to come up with a sequence which would actually deadlock given how the application behaves.

In any case, I think we're left with two issues:

First, the containerPHID update should be made by HarbormasterBuildWorker, not by some other non-coordinating process.

Second, we don't want pass/fail from before a diff is attached to a revision to carry over after it is attached. There are several ways we could accomplish this, I think. In one approach, we prevent the buildable from completing into pass/fail before we have a revision:

  • We could not queue any updates when messages are sent to autotargets in harbormaster.sendmessage, and assume the first time a diff buildable will get an update is after it is attached to a revision.
  • A little more robustly, we could consider buildables with only autobuilds to be in a limbo state until real builds arrive.
  • Or we could consider diff buildables to be in limbo until they're attached to revisions.
  • Or we could start buildables in a "preparing" state and require some kind of explicit trigger to let them complete.

I think these are all roughly equivalent, although they differ a bit in semantics and robustness.

In another approach, we could reset things:

  • We let the buildable complete normally.
  • When BuildWorker sets the containerPHID, we kick the build back into building, then run the state transition logic again.

This all feels very muddy to me. I think there are a couple of larger design questions here:

  • What is the future of autotargets?
  • Does building a diff (vs a revision) ever make sense?

What is the future of autotargets?

Autotargets are this sort of weird hack/generalization to let arc push lint/unit results into Harbormaster without needing to configure anything, and without everything else in the pipeline needing special logic. The upstream use cases don't really care which of these solutions is selected since they're all largely equivalent.

I think there are probably very few other reasonable use cases for autotargets. I think one possible use case was to accommodate results from build-on-push systems which could not be triggered, but most of the popular ones now support triggering (e.g., Travis can be triggered via API now). Even if we imagine that some "Travship2" system becomes extremely popular tomorrow and can not trigger builds, newer upstream features like drafts mean that we really need to know what builds we're expecting by the time revision creation finishes. You could conceivably accommodate "Travship2" by making a dummy request to an intermediary server with "wait for response" so that we know we're expecting a result, then submitting the result back when it arrives. I think I describe this in more detail somewhere, although I can't immediately find it.

This is complex, but "draft" means that it's pretty much necessary now.

The other vaguely plausible use case is triggering services from arc instead of from Phabricator, so arc starts a build and then the build reports to Phabricator. I think this is probably sketchy and unlikely to be the best design given staging areas.

You can also always approximate an autotarget with a real target, although we'd need to expose a little more API surface area for this.

I think the upshot is that it's probably fine for "autotargets" to be a very thin/leaky layer above lint/unit and not a first-class mechanism, and maybe to eventually just be absorbed into more general API methods.

Does building a diff (vs a revision) ever make sense?

I think the answer is "probably not", but it seems a little silly to prevent it if we don't have to.


So I think this leaves us with a "PREPARING" buildable state ("waiting for all the builds to be scheduled") that buildables start in by default, and can not transition out of until they receive an "all done scheduling builds" message. This message is sent to each "PREPARING" buildable when Herald runs.

Triggering a buildable via the web console or the CLI tool will immediately send "all done scheduling builds".

Triggering a buildable via harbormaster.queryautotargets won't send this message, and diffs won't send this message (because they don't run Herald in the right way, but also because that's what we want).

This means that building just a diff stops working. Probably, nothing really does this now, but it would be nice for arc to be able to say "okay, and I'm not going to attach this to a revision, so you can finish the build" when arc diff --only finishes.

We should also probably (?) combine this with a state reset so that this works completely consistently:

  • Create a diff with arc diff --only.
  • Attach it to a revision via the web UI.

This is further complicated because HarbormasterBuildMessage already exists (but is sent to BuildTarget, not Build), and we need a new message bus for Buildable, which doesn't currently have one. We also already have harbormaster.sendmessage but we need another very similar method for sending messages to buildables. We need this message bus for the containerPHID anyway, though.

So we end up with something like this, I think:

  • New message bus for Buildable: HarbormasterBuildableMessage -- although it would be nice to just reuse the existing message bus and let messages on the bus be sent to either receiver.
  • The containerPHID update happens over this bus as an "attach" message.
  • Buildable gets a new PREPARING state ("figuring out what to build...") which is the default initial state.
  • Buildable transitions to BUILDING via a new message ("all done scheduling builds").
  • BuildWorker or some similar daemon needs to be able to update just a buildable via BuildEngine.
  • CLI and web builds send "all done scheduling builds" immediately.
  • TransactionEditor sends "all done scheduling builds" after it schedules Herald builds.
  • For now, Diffs never complete scheduling on their own. This will be the default behavior without any additional changes. This will leave arc diff --only buildables in a "PREPARING" state indefinitely. We can refine this later.
    • We can deal with letting harbormaster.sendmessage send buildable messages later, too: the only one it needs to send is "done scheduling things" after arc diff --only.

Before I start making these changes, I'm going to verify that I can actually reproduce this race. I expect to:

  • Add a 5-second sleep to arc before it calls differential.createrevision. This will let the BuildWorker start after lint/unit results are sent.
  • Add a 10-second sleep to BuildWorker after it assesses target statuses but before it updates the build. This will let the revision creation happen.
  • When the dust settles, I expect the build to be "passed" (based on the autotargets only, despite having open builds triggered by Herald) with a null containerPHID, and for it to publish this improper "pass" to the revision, which is consistent with the issues observed in PHI268 and PHI302.
ftdysa added a subscriber: ftdysa.Feb 12 2018, 6:10 PM

When the dust settles, I expect

There are a couple more tricks here (when a build triggers, the Buildable goes back into "building") but I was pretty much able to reproduce everything with the sleeps I described, plus a sleep before HarbormasterBuildable::applyBuildPlans().

epriestley updated the task description. (Show Details)Feb 12 2018, 8:37 PM
epriestley updated the task description. (Show Details)Feb 12 2018, 10:02 PM
epriestley closed this task as Resolved.Feb 12 2018, 11:01 PM
epriestley updated the task description. (Show Details)

See T13072 for Harbormaster followups.

See T13073 for PHI312 and related Drydock followups.

That is, we mutated the object outside of the Editor, then relied on the Editor to save the mutation.

Removing dashboard panels also has one of these, see PHI385.