Page MenuHomePhabricator

Add objectPHID to Harbormaster workers to make root cause analysis easier
Closed, ResolvedPublic

Description

I've run into this issue 3 times over the course of ~3000 builds, and only ever seen it in production. Can't reproduce on development systems, but the way these builds are scheduled and run is wholly homogenous, and works the vast majority of the time.

I'll air the dirty laundry of the fact that these builds are not scheduled by herald or repository operations, which I suspect are the only two "officially supported methods". Instead, they are scheduled like this:

$buildable = HarbormasterBuildable::initializeNewBuildable($queue->getAgent())	
  ->setBuildablePHID($integration_plan->getHarbormasterBuildablePHID())	
  ->setContainerPHID($integration_plan->getHarbormasterContainerPHID())	
  ->save();	
	
$build = $buildable->applyPlan($queue->getBuildPlan(), [], $submission->getSubmitterPHID());

An "integration plan" is a DAO object conforming to HarbormasterBuildableInterface. If this sounds like we've wandered too far into the deep end and this isn't something the upstream will support, understandable. However, the way this issue manifests itself suggests to me that it's strictly a harbormaster / drydock problem that it's probably going to flare up during the course of normal operations on regular installs.

Symptoms:

  • Buildable created properly
  • No errors in daemon logs
  • "Restarts" says zero for the build
  • Build status is "pending", no steps in the build plan have started
  • First step in the build plan is "lease working copy", which has been configured correctly via almanac etc, no trickery
  • The first step doesn't actually show up under the build as having started
  • Simply restarting the build fixes everything

We have or builds sitting in a queue for verification before they are applied to master, and in some cases the line for that queue can get long, so when this happens it holds the line up and doesn't inform anyone that there might be a problem. Usually sits there for a while until we notice it and hit restart.

Given the relatively nebulous nature of this ticket, I'm totally willing to go try to track this down myself, I guess I'm just wondering if you can point me in the direction of likely things causing this?

Event Timeline

yelirekim claimed this task.
yelirekim updated the task description. (Show Details)
yelirekim added projects: Drydock, Harbormaster.
yelirekim updated the task description. (Show Details)
yelirekim added subscribers: yelirekim, epriestley.
yelirekim updated the task description. (Show Details)

It's vaguely possible that openTransaction() is the culprit. applyPlan() queues a daemon task, and I could imagine the race might be:

  • Daemons pick task up immediately.
  • They query for the buildable or targets, etc.
  • They can't see them since they aren't inside the transaction.
  • Tasks fail permanently.

This should leave some traces in the logs, though.

If you can't find any such traces and/or removing openTransaction() doesn't fix things, we can probably give you more insight into workers, etc., in the UI. I think daemon queue task metadata is in a better place today than it was when Harbormaster was first written.

Oh, you removed openTransaction(), so that's probably not the culprit.

Let me post the whole block, it didn't seem relevant:

public function createPlan(
  CISubmitQueue $queue,
  PhabricatorRepositoryCommit $commit,
  CISubmitQueueSubmission $submission) {

  $integration_plan = (new CISubmitQueueIntegrationPlan)
    ->setSubmissionID($submission->getID())
    ->attachSubmission($submission)
    ->setCommitPHID($commit->getPHID())
    ->attachCommit($commit);

  if ($submission->getDiffPHID()) {
    $diff = $submission->getDiff();
    $integration_plan
      ->setDiffPHID($diff->getPHID())
      ->attachDiff($diff);
  } elseif ($submission->getMergeIdentifier()) {
    $integration_plan->setMergeIdentifier($submission->getMergeIdentifier());
  }

  $integration_plan
    ->openTransaction()
    ->save();

  $buildable = HarbormasterBuildable::initializeNewBuildable($queue->getAgent())
    ->setBuildablePHID($integration_plan->getHarbormasterBuildablePHID())
    ->setContainerPHID($integration_plan->getHarbormasterContainerPHID())
    ->openTransaction()
    ->save();

  $build = $buildable->applyPlan($queue->getBuildPlan(), [], $submission->getSubmitterPHID());

  $submission->attachActivePlan($integration_plan);

  $plan_xaction = (new CISubmitQueueSubmissionTransaction)
    ->setOldValue(null)
    ->setTransactionType(CISubmitQueueSubmissionTransaction::TYPE_PLAN);

  $this->getSubmissionEditor()
    ->applyTransactions($submission, [
      $plan_xaction,
    ]);

  $buildable->saveTransaction();
  return $integration_plan->saveTransaction();
}

DAO's prefixed CISubmitQueue are in their own database. There has never been any problems with referential integrity / never seen anything in the database that would indicate we'd written records incorrectly. But you may be right about the tasks being written before the plan is written. I suspect that would cause errors though, I can try to reproduce that first in order to rule out our specifics being the problem.

Yeah, I think that can race. Here's an example -- open two connections to MySQL, mysql1> and mysql2>. In the first window do this:

mysql1> begin;
Query OK, 0 rows affected (0.00 sec)

mysql1> INSERT INTO harbormaster_scratchtable (data, dateCreated, dateModified) VALUES ("quack", 0, 0);
Query OK, 1 row affected (0.00 sec)

mysql1> SELECT LAST_INSERT_ID();
+------------------+
| LAST_INSERT_ID() |
+------------------+
|                2 |
+------------------+
1 row in set (0.00 sec)

Now try to find that row in the second window:

mysql2> select * from harbormaster_scratchtable where id = 2;
Empty set (0.00 sec)

The second connection can't see the row yet, because it hasn't been committed.

Commit in the first window:

mysql1> commit;
Query OK, 0 rows affected (0.00 sec)

Now the second window can see the row:

mysql2> select * from harbormaster_scratchtable where id = 2;
+----+-------+-------------+--------------+---------+----------------+
| id | data  | dateCreated | dateModified | bigData | nonmutableData |
+----+-------+-------------+--------------+---------+----------------+
|  2 | quack |           0 |            0 | NULL    | NULL           |
+----+-------+-------------+--------------+---------+----------------+
1 row in set (0.00 sec)

In this case, the potential race is that the worker runs SELECT * FROM harbormaster_buildable WHERE phid = ... before createPlan() commits the transaction. This window should be very narrow, but since you're invoking an Editor (which will have to do at least a bit of work) it seems not-totally-implausible that you hit it 1/1000 times.

Instead, try either removing all the transaction code (at the risk of leaving unattached IntegrationPlan objects on error), or committing both transactions before calling applyPlan(). I'm not sure this is really the issue, but I think there's a sound theoretical basis for the race existing, even if another issue is the actual culprit.

Specifically, if you put like a sleep(10) directly after applyPlan() I would expect that to trigger the race ~100% of the time as long as the queue isn't too full. You could see if the behavior is similar to the behavior you've observed.

If there's a different failure mode, that rules it out but we fixed a bug that might come back to haunt you 1 in 100,000 times or something.

This perfectly recreates the symptoms, except it generates an error in the daemon log:

[2015-11-18 21:45:54] EXCEPTION: (PhutilProxyException) Permanent failure while executing Task ID 1142. {>} (PhabricatorWorkerPermanentFailureException) Invalid build ID "11". at [<phabricator>/src/applications/harbormaster/worker/HarbormasterBuildWorker.php:38]
arcanist(custom=1), integrator(head=master, ref.master=0c76adde6b1e), integrator-client(), phabricator(), phutil()
  #0 <#2> HarbormasterBuildWorker::loadBuild() called at [<phabricator>/src/applications/harbormaster/worker/HarbormasterBuildWorker.php:20]
  #1 <#2> HarbormasterBuildWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
  #2 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
  #3 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
  #4 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
  #5 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]

I'm positive that instances of this occurring don't create errors in our daemon logs. I'm significantly more confused now than when I posted this.

Oh I lied, checked the old logs, this error was present the last two times it occurred, but not present when it happened this evening.

We can leave this at "reduced expected incidence rate to < 1/3k" and I'll come back here if it happens again if you don't have any more obvious ideas.

My next step would be to put more crumbs in the UI, but we're missing a little data right now. Basically, I'd do this:

In HarbormasterBuildable->applyPlan(), and the two other places where HarbormasterBuildWorkers get queued, pass the third $options parameter with objectPHID:

    PhabricatorWorker::scheduleTask(
      'HarbormasterBuildWorker',
      array(
        'buildID' => $build->getID(),
      ),
+    array(
+      'objectPHID' => $build->getPHID(),
+    ));

That will end up in phabricator_worker.worker_activetask in the objectPHID column, and the corresponding phabricator_worker.worker_archivetask column if the task completes. You can query for it the next time it occurs to try to figure out what happened with the workers, and we could potentially put it in the UI somewhere to make the process easier.

You can also theoretically find the task today by table scanning worker_taskdata for JSON substrings in the data column -- you're looking for "buildID":12345 or similar (note that 12345 might be either a string or an int in the JSON representation). If you can find it, the id column will point at a row in either the active or archive table.

This stuff probably won't lead directly to an answer, but might give us some more clues (does the task exist, did it error or succeed, how long did it execute for, etc).

Based on my memory, I think it's possible that exception is the cause of the issue; when it occurs, the exception ends the current worker task with a permanent failure, and the task won't be revisited. Thus the build remains pending because the worker that would have performed the next update failed to do so.

joshuaspence renamed this task from Begin searching for drydock / harbormaster gremlins which prevent builds from starting properly to Begin searching for Drydock / Harbormaster gremlins which prevent builds from starting properly.Nov 19 2015, 7:31 AM
joshuaspence added a subscriber: joshuaspence.

I'm going to repurpose this as "get that data in the database so we're better prepared if this happens again", but feel free to follow up if you go on a hunting expedition or hit the issue in the meantime.

epriestley renamed this task from Begin searching for Drydock / Harbormaster gremlins which prevent builds from starting properly to Add objectPHID to Harbormaster workers to make root cause analysis easier.Nov 21 2015, 4:24 PM
epriestley triaged this task as Normal priority.