Page MenuHomePhabricator

PHD daemons regularly stopping/starting
Closed, ResolvedPublic

Description

I've noticed in the past few days that the daemons are regularly stopping (and also starting back up) -- I'll get the red banner about an unresolved issue, but usually by the time I ssh into the server and do a ./bin/phd status they will be back up and running.

From the daemons.log file, this exception has been appearing since the 14th and has occurred >1500 times. I've only seen evidence of this one exception repeating numerous times, I'm not sure if there are other exceptions happening -- I'm still digging through which is a little difficult as the daemons.log file is ~2GB.

/var/tmp/phd/log/daemons.log
[20-Oct-2015 10:46:53 America/New_York] [2015-10-20 10:46:53] EXCEPTION: (PhutilProxyException) Error while executing Task ID 3671295. {>} (Exception) Diff "PHID-DIFF-fyfkn25bo6gtkutuepax" does not exist! at [<phabricator>/src/applications/differential/editor/DifferentialTransactionEditor.php:1535]
[20-Oct-2015 10:46:53 America/New_York] arcanist(head=stable, ref.master=d54cb072facd, ref.stable=1773aad85599), phabricator(head=stable, ref.master=7cc8a73d1efd, ref.stable=be4752f05a70, custom=1), phutil(head=stable, ref.master=83f09f6c5a03, ref.stable=c5a7d67db294)
[20-Oct-2015 10:46:53 America/New_York]   #0 <#2> DifferentialTransactionEditor::requireDiff(string, boolean) called at [<phabricator>/src/applications/differential/editor/DifferentialTransactionEditor.php:1243]
[20-Oct-2015 10:46:53 America/New_York]   #1 <#2> DifferentialTransactionEditor::buildMailBody(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2331]
[20-Oct-2015 10:46:53 America/New_York]   #2 <#2> PhabricatorApplicationTransactionEditor::buildMailForTarget(DifferentialRevision, array, PhabricatorMailTarget) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2288]
[20-Oct-2015 10:46:53 America/New_York]   #3 <#2> PhabricatorApplicationTransactionEditor::buildMail(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1048]
[20-Oct-2015 10:46:53 America/New_York]   #4 <#2> PhabricatorApplicationTransactionEditor::publishTransactions(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php:21]
[20-Oct-2015 10:46:53 America/New_York]   #5 <#2> PhabricatorApplicationTransactionPublishWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
[20-Oct-2015 10:46:53 America/New_York]   #6 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
[20-Oct-2015 10:46:53 America/New_York]   #7 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
[20-Oct-2015 10:46:53 America/New_York]   #8 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
[20-Oct-2015 10:46:53 America/New_York]   #9 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]

There don't seem to be any related logs in nginx/php-fpm. I'm continuing investigation on my end.

Event Timeline

cspeckmim updated the task description. (Show Details)
cspeckmim added a project: Daemons.
cspeckmim added a subscriber: cspeckmim.

Have you used bin/remove destroy to delete any diffs or revisions?

Have you used bin/remove destroy to delete any diffs or revisions?

I don't recall having ever used that command on this install. I've only used that once back in Feb/March and I'm pretty sure it was on a test install.

Alright. In the short term, you can bin/worker cancel --id <id> any tasks with a bunch of failures to stop that specific task from retrying. Presumably a handful of problem tasks are causing the majority/entirety of the issue.

Ok thanks -- here are the leased tasks which I'll cancel:

Screen Shot 2015-10-20 at 11.51.40 AM.png (656×918 px, 146 KB)

Noting here, this started happening again within the past day or two. I'm going to cancel the leased tasks again, but would like to find some way to determine what's going on. I suspect there might be some diff's that are no longer around (not sure how) which are causing this. Will grep through the daemon logs at some point.

Here are my diggings in the logs to determine how many diffs are problematic (there are 27)

cspeckrun@specktop ~/D/daemon_logs> grep -o -E "PHID-DIFF-[a-z0-9]{20}" daemons.log | sort | uniq
PHID-DIFF-2tkk6i73v4o32wqfqcsb
PHID-DIFF-4tnlystz7it5w44j7imd
PHID-DIFF-5i5uoael4eeuhdj5hsgw
PHID-DIFF-75mxqtmqo6f7xot6ciar
PHID-DIFF-7kwzsmaymdbdbxxo7lru
PHID-DIFF-c5qhg3kcivjarfwd2wek
PHID-DIFF-cptimb5cxdariuh4xufh
PHID-DIFF-eqo6o5le2phkdlflqyxu
PHID-DIFF-f5fbtt6ap6odesigjbmj
PHID-DIFF-fyfkn25bo6gtkutuepax
PHID-DIFF-fzyhpnm7eb6gkwlo42pf
PHID-DIFF-ghrwxjwtf7x7ngxyctfn
PHID-DIFF-hv7oxcttgnlksvixhya3
PHID-DIFF-ial33nwcdim4it3ne2pd
PHID-DIFF-irahpyhwfl4mkrvwpcpf
PHID-DIFF-j3ucrnpm6ytqhnb6yv4y
PHID-DIFF-lu3vilalrz6bl4e3aexj
PHID-DIFF-lvwde63gotffedkrmqlm
PHID-DIFF-nl2q7sybe77ydc3rnh2d
PHID-DIFF-o4ugvxgy3djdptvi5dn7
PHID-DIFF-pzrxnjtrm4hmbik4d7af
PHID-DIFF-rn3eaygvie4ylbxqxmdd
PHID-DIFF-rxuv3g5g3pibqq7o4gxj
PHID-DIFF-szgjp2bc6znknxaddesg
PHID-DIFF-vohgcadppoygmn3judfy
PHID-DIFF-yyziqbxidx575ey5b3w2
PHID-DIFF-zplnxsntawlxmhu7sag4
cspeckrun@specktop ~/D/daemon_logs> ls -l
total 294424
-rw-r--r--+ 1 cspeckrun  staff  120935264 Dec 17 23:12 daemons.log
cspeckrun@specktop ~/D/daemon_logs> grep -o -E "PHID-DIFF-[a-z0-9]{20}" daemons.log | wc -l
   45645

The log file (since October) is ~120mb and most of it are these exceptions. I'm not familiar with which task is failing, but it seems to reschedule frequently. If the diff is missing, would it be safe to assume to not retry the task after a failure?

I still have no idea what these diffs originally were or how they went missing. I'm not really sure how to go about investigating this.

epriestley claimed this task.

Probably a dupe of T11708? I'm just going to kill this one since it's old as dirt.