daemons constantly in need of restarting, after failed SVN import atttempt
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	anda
	Jun 29 2015, 1:38 PM

Description

I tried to import a SVN-repository and got stuck on some SSL-issues. After solving those, the import seemed to get stuck at 0%. When I finally noticed that something was wrong, quite some time had passed.

This is the first error after the SSL-certificate was fixed, was this:

[29-Jun-2015 14:04:22 Europe/Stockholm] [2015-06-29 14:04:22] EXCEPTION: (PhutilProxyException) Permanent failure while executing Task ID 309683. {>} (PhabricatorWorkerPermanentFailureException) Commit "43374" does not exist. at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:23]
[29-Jun-2015 14:04:22 Europe/Stockholm] arcanist(head=stable, ref.master=e97cdc6c9a6c, ref.stable=29839e8c72c5), phabricator(head=stable, ref.master=2fbc65e39649, ref.stable=7f22a0e0fd43), phutil(head=stable, ref.master=88f6ecedbd45, ref.stable=ea6830cc8019)
[29-Jun-2015 14:04:22 Europe/Stockholm]   #0 <#2> PhabricatorRepositoryCommitParserWorker::loadCommit() called at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:31]
[29-Jun-2015 14:04:22 Europe/Stockholm]   #1 <#2> PhabricatorRepositoryCommitParserWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:91]
[29-Jun-2015 14:04:22 Europe/Stockholm]   #2 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:162]
[29-Jun-2015 14:04:22 Europe/Stockholm]   #3 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
[29-Jun-2015 14:04:22 Europe/Stockholm]   #4 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
[29-Jun-2015 14:04:22 Europe/Stockholm]   #5 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]

After noticing this, I've tried to remove the SVN repository in question, as instructed by running bin/remove destroy rCS. Restart both apache, phd daemons, and finally the whole server... Hower, those errors still continue to show up constantly:

~/phabricator$ bin/phd debug task
Launching daemons (in debug mode):
(Logs will appear in "/var/tmp/phd/log/daemons.log".)

    PhabricatorTaskmasterDaemon (Static)


    phabricator/scripts/daemon/ $ ./phd-daemon '--trace' '--verbose'

>>> [2] <event> daemon.didLaunch <listeners = 2>
>>> [3] <connect> phabricator_daemon
<<< [3] <connect> 918 us
>>> [4] <query> INSERT INTO `daemon_log` (`daemon`, `host`, `pid`, `daemonID`, `runningAsUser`, `argv`, `explicitArgv`, `envHash`, `envInfo`, `status`, `dateCreated`, `dateModified`) VALUES ('PhabricatorTaskmasterDaemon', 'cs1', '3598', '3598:y4ewcnv', 'phabricator', '[\"--trace\",\"--verbose\"]', '[]', 'f73fefb876e1fbc33e71d3879f6999afc72c229f', '{\"version\":1,\"keys\":\"22i1Mz7MGwus\",\"values\":{\"auth.email-domains\":\"av9oIGyxbzc1\",\"config.ignore-issues\":\"tZn8TIpKCvfy\",\"differential.require-test-plan-field\":\"
<<< [4] <query> 51,869 us
<<< [2] <event> 78,881 us
2015-06-29 3:32:52 PM [INIT] Starting process.
>>> [5] <event> daemon.didLogMessage <listeners = 2>
>>> [6] <query> INSERT INTO `daemon_logevent` (`logID`, `logType`, `message`, `epoch`) VALUES ('334', 'INIT', 'Starting process.', '1435584772')
<<< [6] <query> 2,959 us
<<< [5] <event> 4,498 us
>>> [7] <exec> $ exec ./exec_daemon.php 'PhabricatorTaskmasterDaemon' '--trace' '--verbose'
<<< [7] <exec> 500,895 us
2015-06-29 3:32:53 PM [STDE] >>> [0] <connect> phabricator_config
<<< [0] <connect> 4,660 us
>>> [1] <query> SELECT * FROM `config_entry` WHERE namespace = 'default' AND isDeleted = 0 
<<< [1] <query> 302 us
>>> [2] <connect> phabricator_worker
<<< [2] <connect> 881 us
>>> [3] <query> SELECT id, leaseOwner FROM `worker_activetask` WHERE (leaseOwner IS NULL) ORDER BY priority ASC, id ASC LIMIT 1
<<< [3] <query> 576 us
>>> [4] <query> UPDATE `worker_activetask` task
              SET leaseOwner = '3625:1435584772:cs1:1', leaseExpires = UNIX_TIMESTAMP() + 7200
              WHERE (leaseOwner IS NULL) AND (id IN (313325))
<<< [4] <query> 39,441 us
>>> [5] <query> SELECT task.*, taskdata.data _taskData, UNIX_TIMESTAMP() _serverTime
        FROM `worker_activetask` task LEFT JOIN `worker_taskdata` taskdata
          ON taskdata.id = task.dataID
        WHERE task.leaseOwner = '3625:1435584772:cs1:1' AND leaseExpires > UNIX_TIMESTAMP() ORDER BY priority ASC, id ASC LIMIT 1
<<< [5] <query> 8,200 us
<<VERB>> PhabricatorTaskmasterDaemon Working on task 313325 (PhabricatorRepositorySvnCommitMessageParserWorker)...
>>> [6] <connect> phabricator_repository
<<< [6] <connect> 800 us
>>> [7] <query> SELECT * FROM `repository_commit` WHERE `id` = 45722 
<<< [7] <query> 398 us
>>> [8] <query> START TRANSACTION
<<< [8] <query> 221 us
>>> [9] <query> DELETE FROM `worker_activetask` WHERE id = 313325
<<< [9] <query> 412 us
>>> [10] <query> INSERT INTO `worker_archivetask` (`duration`, `result`, `taskClass`, `leaseOwner`, `leaseExpires`, `failureCount`, `dataID`, `priority`, `objectPHID`, `id`, `dateCreated`, `dateModified`) VALUES ('0', '1', 'PhabricatorRepositorySvnCommitMessageParserWorker', '3625:1435584772:cs1:1', '1435591972', '0', '313323', '2000', NULL, '313325', '1435584773', '1435584773')
<<< [10] <query> 13,115 us
>>> [11] <query> COMMIT
<<< [11] <query> 3,844 us
[2015-06-29 15:32:53] EXCEPTION: (PhutilProxyException) Permanent failure while executing Task ID 313325. {>} (PhabricatorWorkerPermanentFailureException) Commit "45722" does not exist. at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:23]
arcanist(head=stable, ref.master=e97cdc6c9a6c, ref.stable=29839e8c72c5), phabricator(head=stable, ref.master=2fbc65e39649, ref.stable=7f22a0e0fd43), phutil(head=stable, ref.master=88f6ecedbd45, ref.stable=ea6830cc8019)
  #0 <#2> PhabricatorRepositoryCommitParserWorker::loadCommit() called at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:31]
  #1 <#2> PhabricatorRepositoryCommitParserWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:91]
  #2 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:162]
  #3 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
  #4 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
  #5 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]
>>> [8] <event> daemon.didLogMessage <listeners = 2>
>>> [9] <query> INSERT INTO `daemon_logevent` (`logID`, `logType`, `message`, `epoch`) VALUES ('334', 'STDE', '>>> [0] <connect> phabricator_config\n<<< [0] <connect> 4,660 us\n>>> [1] <query> SELECT * FROM `config_entry` WHERE namespace = \'default\' AND isDeleted = 0 \n<<< [1] <query> 302 us\n>>> [2] <connect> phabricator_worker\n<<< [2] <connect> 881 us\n>>> [3] <query> SELECT id, leaseOwner FROM `worker_activetask` WHERE (leaseOwner IS NULL) ORDER BY priority ASC, id ASC LIMIT 1\n<<< [3] <query> 576 us\n>>> [4] <query>
<<< [9] <query> 2,621 us
<<< [8] <event> 3,911 us
2015-06-29 3:32:53 PM [FAIL] Process exited with error 255

The commit id and task id is constantaly increasing by 1.

Any pointer on how to get rid of these? Or what kind of info I need to find?
I'm perfectly fine with not importing the SVN-repository again (it's mostly HW design in that for the moment). However, I'd like to not have the phd daemons constantly being in need of a restart.

Related Objects

Mentioned In: T8707: Task descriptions with leading newlines are not preserved correctly across edits
Mentioned Here: T3554: Implement a `bin/worker kill <criteria>` for batch-flushing tasks from the queue
T7079: Implement `bin/worker purge` or similar to destroy tasks with unknown task classes

Event Timeline

anda created this task.Jun 29 2015, 1:38 PM

anda raised the priority of this task from to Needs Triage.

anda updated the task description. (Show Details)

anda added a subscriber: anda.

When you say "need of a restarting", what do you mean, specifically?

These errors all look normal, and its expected for the daemons to exit after encountering errors like this. They should restart themselves automatically after a few seconds. Are you not seeing that?

Hm, looking closer at the trace above, it looks like phabricator is still trying to parse some SVN commit messages, PhabricatorRepositorySvnCommitMessageParserWorker. How can I completely wipe out all traces of this repository, as it seems that the removal above wasn't enough.
`

@epriestley well, yes they do restart again. Just to fail once again on a new commit. Which basically means that I'm constantly having the banner of an unresolved setup issue (as no daemons are currently running).

As I've tried to remove imported repository completely, I'd expect those errors to stop coming. To me it looks like as there still something left that phabricator tries to parse.

Did you get this warning when you ran bin/remove destroy?

phabricator/ $ ./bin/remove destroy rXYZ


                                  uuuuuuu
                               uu###########uu
                            uu#################uu
                           u#####################u
                          u#######################u
                         u#########################u
                         u#########################u
                         u######"   "###"   "######u
                         "####"      u#u       ####"
                          ###u       u#u       u###
                          ###u      u###u      u###
                           "####uu###   ###uu####"
                            "#######"   "#######"
                              u#######u#######u
                               u#"#"#"#"#"#"#u
                    uuu        ##u# # # # #u##       uuu
                   u####        #####u#u#u###       u####
                    #####uu      "#########"     uu######
                  u###########uu    """""    uuuu##########
                  ####"""##########uuu   uu#########"""###"
                   """      ""###########uu ""#"""
                             uuuu ""##########uuu
                    u###uuu#########uu ""###########uuu###
                    ##########""""           ""###########"
                     "#####"                      ""####""
                       ###"                         ####"


...

 IMPORTANT  DELETING OBJECTS OFTEN BREAKS THINGS

Destroying objects may cause related objects to stop working, and may leave
scattered references to objects which no longer exist. In most cases, it is
much better to disable or archive objects instead of destroying them. This
risk is greatest when deleting complex or highly connected objects like
repositories, projects and users.

These tattered edges are an expected consquence of destroying objects, and
the Phabricator upstream will not help you fix them. We strongly recomend
disabling or archiving objects instead.

...

Hm, quite likely, yes...

Ok, so lets close this one then... I guess.

This still sounds like a potential issue, I just wanted to make sure the messaging around it isn't unclear.

The daemon queue will fail all these tasks eventually and go back to normal. You can see how many tasks are left in the queue at /daemon/.

There isn't an easy way to bulk-clear all these tasks right now. You can cancel them individually with bin/worker cancel (see T3554) but there's no bin/worker cancel-everything-matching <some query> (see also T7079).

You can delete them from the database manually, but this is a little dangerous.

Deleting the repository should ideally clean these tasks up, but we don't currently set the objectPHID for them correctly so the DestructionEngine doesn't know to destroy the workers.

We currently treat PhabricatorWorkerPermanentFailureException the same way we treat unexpected failures and bail out completely, but there's no technical reason for this. This makes discarding failing tasks relatively slow. Possibly, we should resume processing tasks immediately after encountering a permanent failure.

If you hit the first error before destroying the repository, that's also a real bug, although one that we may have a hard time reproducing.

(But it's also expected that when you "y" through the giant red ASCII art skull prompt, you're accepting consequences like this.)

Yes, I'm certainly going to take my responsibility for that... (Though, maybe the command to delete repositories shouldn't be given completely in clear in the web-interface. As that likely made it sligthly easier to disregard the scary skull prompt...)

Yep, the first error occured a couple of minutes after I had fixed the SSL issues, and it was at least 45min to an hour later I that I first tried to disable and delete the repository. That was when the importer still was at 0.0%...

Good, I likely have a look at doing it the manual way tomorrow (when I get back into the office). I just have to find a good way to find all those tasks. Do you think it would be possible to find them easier as this was the only SVN repo? (Otherwise we're just using git).

chad updated the task description. (Show Details)Jun 29 2015, 4:40 PM

chad added a project: Daemons.

chad mentioned this in T8707: Task descriptions with leading newlines are not preserved correctly across edits.Jun 29 2015, 4:56 PM

You were right, a few hour later, the it seems that all outstanding tasks had been failed, and the restarts stopped.

(I never noticed in the beginning that the task queue under /daemon/ were shrinking, so I jumped to the conclusion that it was the same tasks being re-enumerated constantly.)

Thus it seems that the only issue left, that's worth to be in this task, is the question why the original exceptions started, as that was well in advance of me trying to remove the repo in question. OTOH, I've seen a few other tasks having issues with SVN imports, especially if there was any large commits. I need to double the history to see if any of the early commits are extremely large; I know that further down the history of the SVN repo, there are some fairly large commits.

And unfortunately, as usual, the SVN repo in question is behind a firewall etc...

This is almost a year old and we have no idea how to reproduce it. If anyone hits this, feel free to file a new issue with reproduction steps (see Providing Reproduction Steps).

daemons constantly in need of restarting, after failed SVN import atttemptClosed, InvalidPublicActions

Description

Related Objects

Event Timeline

daemons constantly in need of restarting, after failed SVN import atttempt
Closed, InvalidPublic
Actions