Recently we see issues with huge commits (branch cuts for www) where people received hundreds of emails for the same commit. By checking all the active and archived tasks related to such commits, I saw the following pattern:
- The commit itself is marked as importStatus = 15 which means all the processing was actually done;
- In archived tasks, I see one PhabricatorRepositorySvnCommitMessageParserWorker, one PhabricatorRepositorySvnCommitChangeParserWorker, followed by many PhabricatorRepositoryCommitHeraldWorker, which means that the PhabricatorRepositoryCommitOwnersWorker (who schedule those herald tasks) was never done;
- PhabricatorRepositoryCommitOwnersWorker is always active (for days) with failureCount = 0;
- In daemon log I see a lot of lease expire exception for PhabricatorRepositoryCommitOwnersWorker.
So to me it looks like the following happened:
- Everything is fine until we schedule the PhabricatorRepositoryCommitOwnersWorker
- PhabricatorRepositoryCommitOwnersWorker actually successfully finished but its running time exceed 60s. Before it finishes, it scheduled the PhabricatorRepositoryCommitHeraldWorker task
- When we try to archive it, the lease expiration exception happened. As a result, it stayed active and will be picked up immediately since it is in the head of the queue
- The two steps above repeat forever until we kill it
I am not sure why we want to check lease expiration when we are archiving the task. For now I am giving the worker a little more time since parsing half million affected path needs some time..