Daemons erroneously saying "Diff does not exist"
Closed, DuplicatePublic

Description

We're seeing the following error in our daemon logs:

Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400] [2016-09-28 13:01:49] EXCEPTION: (PhutilProxyException) Error while executing Task ID 9802706. {>} (Exception) Diff "PHID-DIFF-pea6qvgvhs34cojvyi3s" does not exist! at [<phabricator>/src/applications/differential/editor/DifferentialTransactionEditor.php:1460]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400] arcanist(), integrator(head=master, ref.master=00fff5664e0d), integrator-client(), phabricator(), phutil(), sprint()
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #0 <#2> DifferentialTransactionEditor::requireDiff(string, boolean) called at [<phabricator>/src/applications/differential/editor/DifferentialTransactionEditor.php:1254]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #1 <#2> DifferentialTransactionEditor::buildMailBody(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2511]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #2 <#2> PhabricatorApplicationTransactionEditor::buildMailForTarget(DifferentialRevision, array, PhabricatorMailTarget) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2468]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #3 <#2> PhabricatorApplicationTransactionEditor::buildMail(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1150]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #4 <#2> PhabricatorApplicationTransactionEditor::publishTransactions(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php:21]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #5 <#2> PhabricatorApplicationTransactionPublishWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #6 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #7 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #8 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:184]
Daemon 489647 STDE [Wed, 28 Sep 2016 13:01:49 -0400]   #9 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]

The error gets thrown here (line 1460).

The weird thing is that the diffs do exist when that error gets thrown, and often they are hours or days old. I still don't have any solid repro steps, but it looks like this has happened to about 50 distinct diffs in the past week (resulting in ~20k of these errors due to retries). I'm not quite sure how that block of code can get hit if the diff exists in the DB. I'll update as soon as I have repro steps.

Version Info

phabricator 6725f3719da3498670e07a1922762bbb4165764d (Tue, Sep 6)
arcanist 10e5194752901959507223c01e0878e6b8312cc5 (Aug 26 2016)
phutil c14343ee620e43f0ba55eaf80aabcba78c732d45 (Mon, Sep 5)

jcox created this task.Sep 28 2016, 6:55 PM

I believe I've seen these exceptions for the past year or so on our setup as well, though I hadn't actually confirmed the diffs do exist. I did confirm that nobody ever manually deleted diffs though.

If an object exists but is failing to load, it's usually because a required related object does not exist (in most cases, "required" means "required in order to evaluate policy checks").

For example, if a commit exists but the repository it belongs to does not, the commit will fail to load. Broadly, we must fail to load it because it's impossible for us to apply policy checks without the corresponding repository.

So this is expected if:

  • the diff belongs to a repository which you destroyed with bin/remove destroy (but we should probably fail permanently); or
  • the diff belongs to a revision which you destroyed with bin/remove destroy (but we should delete the diff already?); or
  • the diff belongs to a revision which belongs to a repository which you destroyed with bin/remove destroy (this should be a permanent failure too).

To diagnose this using the most advanced space-age diagnostic techniques, write a script like this:

test.php
#!/usr/bin/env php
<?php

require_once 'scripts/init/init-script.php';

// Do enough argument parsing to make "--trace" work.
$args = new PhutilArgumentParser($argv);
$args->parseStandardArguments();
$args->parse(array());

$target_phid = 'PHID-DIFF-????';

$viewer = PhabricatorUser::getOmnipotentUser();
$diff = id(new DifferentialDiffQuery())
  ->setViewer($viewer)
  ->withPHIDs(array($target_phid))
  ->executeOne();

if ($diff) {
  echo "Diff loaded.\n";
} else {
  echo "Diff did not load.\n";
}

Dump that in phabricator/, then run it:

phabricator/ $ nano test.php
phabricator/ $ # Write the script.
phabricator/ $ chmod +x test.php
phabricator/ $ ./test.php --trace

Now pop open DifferentialDiffQuery and add print statements above all the unset() calls, which are filtering results as not loadable. That usually pins down the cause pretty quickly.

(Conceivably, we could make Query classes emit some sort of information about why they were filtering out existing results on a side-channel, but it hasn't felt like a huge timesuck in my experience so far.)

Also possible is stuff like "the diff has a custom policy which you destroyed", etc.

jcox claimed this task.Oct 5 2016, 7:17 PM

Just spent a bit of time trying to repro this in a script but could not. The diffs always load successfully from the script. I even tried waiting until right after one of the errors popped up in the logs and then queried for that diff via the script right away.

I'd say it could be a race condition somewhere, but sometimes the diff's creation date is days before when the error occurs.

I don't think permissions are at play because the daemons should run as OmnipotentUser.

None of the 3 cases epriestley mentioned seem to be true for us; we rarely ./bin/remove destroy anything and we're seeing these errors on diffs across all of our repos.

I think my next step with this issue will be to put together a diff that adds some more logging information around these errors. Hopefully that results in either an "A-ha!" or a :facepalm: moment.

I think there is an extremely rare policy error somewhere in the policy code that we've occasionally seen in the form of sporadic unit test failures. I haven't been able to reproduce them in the same environment and they're false negatives (e.g., prevent access when it would be OK to allow access) so I haven't spent a ton of effort hunting them down, but it's possible that they have the same cause.

If so, my best guess is parent object caching interacting with test execution order (we run tests in random order to catch execution-order bugs).

This is probably easy to fix if we can get a repro but that's been elusive so far.

Also maybe related is that mail for "auser attached a revision to Txxxx: ..." sometimes has a false negative (shows "Unknown Revision"). I assumed that was some weird mail race and it's super rare so I haven't hunted that one down either, but it's possible that it's same-cause as these other apparent false-negative policy failures.

jcox added a comment.Oct 5 2016, 7:53 PM

Ah yeah it could be related to that (policy errors/unit test failures), although I wouldn't really describe this as "extremely rare" (except of course when I'm trying to make it happen). Some stats from grepping through logs: since last week it looks like this error has happened 73k times involving 193 distinct diffs. It's like the daemons are laughing at my inability to repro it.

Something cache-related seems promising, since it seems like most of the diffs that do fail tend to fail multiple times.

We're also seeing about 100k of these failures in our logs over the past few weeks. I clear out all the tasks and they quickly return. Have you found the root cause?

EXCEPTION: (PhutilProxyException) Error while executing Task ID xxxxxxxxx. {>} (Exception) Diff "xxxxxxxxx" does not exist! at [<phabricator>/src/applications/differential/editor/DifferentialTransactionEditor.php:1454]

Is there a way of just killing the tasks for these permanently? I'm also concerned this is putting unneeded data in our database.

The upstream still doesn't have a reproduction case; we won't act until we have one. See Providing Reproduction Steps.

Also, like @jcox, the diffs always load in that advanced space-age script like there's no problem. Perhaps these exceptions are aware of our attempts to end them. @epriestley Do you have an idea of something else we could try to see why so many exceptions are occurring?

jcox added a comment.Oct 12 2016, 1:42 AM

@jessjohnson I still haven't had a chance to dig further into this, but I'll report back when I do (or if you figure anything out, please let me know). I don't think killing off the tasks permanently is really an option since this seems to happen for specific degenerate instances of otherwise-needed tasks. Perhaps not allowing these tasks to retry would be an okay bandaid, but I'd still like to figure out the underlying problem if we can.

Just checked this morning and many tasks have returned and started to fail again. I'm sure there's a way to reproduce this, but our setup is a bit complex and I'm not sure where to start. @epriestley Could you tell me how to determine what these tasks are even doing? Is there a way to see what action triggered them and why they will fail indefinitely?

See Consulting if you'd like me to help you personally. Thanks!

@jcox, I dove into this a bit, but like you, I didn't come up with much. We have not used bin/remove destroy for any of our repos and there doesn't appear to be any meaningful relationship between the failing tasks that I have found yet. The script that @epriestley kindly provided was able to load any diff I threw at it. The script was helpful in determining the right queries to make to investigate further.

Here's how my investigation went:

I started by grepping the log for all PHIDs (it appeared that PHIDs only showed up when there were failures)

cat /var/tmp/phd/log/daemons.log | grep -o 'PHID\-DIFF\-.\{20\}' > output.txt

Sorted the output and recorded how many occurances:

sort output.txt | uniq -c > output_sorted.txt

Then I attempted to find out if there was anything related among all those. For example, I massaged the data and searched the database for all repositories to see if there was any relation.

Find the repositories

SELECT COUNT('repositoryPHID'), repositoryPHID 
FROM `differential_diff`   
WHERE (phid IN ('PHID-DIFF-xxxxxxxxxxxxxxxx','PHID-DIFF-xxxxxxxxxxxxxxxx','PHID-DIFF-xxxxxxxxxxxxxxxx',...))  
GROUP BY repositoryPHID

Get the repository information:

SELECT `r`.*, r.* 
FROM `repository` r  
WHERE (r.phid IN ('PHID-REPO-xxxxxxxxxxxxxxxxxxxx','PHID-REPO-xxxxxxxxxxxxxxxxxxxx','PHID-REPO-xxxxxxxxxxxxxxxxxxxx',...))   ORDER BY `r`.`id` DESC

I didn't see any apparent relationship, so I just ended up inserting some code to kill the task after 5 tries because the log was growing too large.

phabricator/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php

// kill task after 5 attempts
public function getMaximumRetryCount(){
    return 5;
}

@jcox I just noticed this line from you:

I don't think permissions are at play because the daemons should run as OmnipotentUser.

I agree, it should be, but I'm not sure that's the case. The only real difference between the code that's running with phd and the test.php script is the viewer being set.

From test.php (referenced above)

$viewer = PhabricatorUser::getOmnipotentUser();
$diff = id(new DifferentialDiffQuery())
  ->setViewer($viewer)
  ->withPHIDs(array($target_phid))
  ->needChangesets(true)
  ->executeOne();

From phabricator/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php

$query = id(new DifferentialDiffQuery())
      ->withPHIDs(array($phid))
      ->setViewer($this->getActor());

Do you know how we can know what the viewer is set to when running phd?

jcox added a comment.Oct 13 2016, 11:02 AM

Hmmm I'm not seeing those lines in that file anywhere. On the current head of master, I'm pretty sure the viewer is always set to PhabricatorUser::getOmnipotentUser() in TransactionPublishWorker. Although maybe that isn't making it all the way to the policy aware query in every case.

Although it's still a little early to be sure, we appear to have fixed the policy issue with unit test failures (D17029). Not sure if that actually impacted this or not.

When this is in stable and we upgrade I will have a reminder to keep an eye on logs and come back here

I upgraded last Friday to 2604c5af55f654d36f8db2f080b96486c4572216, so far this exception has not popped up again. I will check again later in the week.

I'm just going to merge this into T12115, which isn't really related, but will rewrite this code and probably "fix" this, since a reproduction case seems elusive.

I don't believe we've seen this pop up again. I can't be certain because right now all the log files are empty, but that means the errors went away?