Our users are receiving duplicate emails. For example, 1 user has received an email every 5 minutes for the past 3 hours, for each of 3 issues. So 3 * (60/5) * 3 emails. All 3 email groups are 'somebody accepted a commit'.
Description
Revisions and Commits
Restricted Diffusion Commit | |||
rP Phabricator | |||
D13320 | rP30c4783c426f Dramatically limit the range of failures which can cause duplicate mail | ||
D13319 | rPcace777ba0db Fail gracefully on bad API call when building commit mail | ||
D13319 | rPbb7f2ea9055c Fail gracefully on bad API call when building commit mail |
Related Objects
- Mentioned In
- rPc5960c71f929: Splice in a patch to remove Phriction content rows with no document
- Mentioned Here
- rPbb7f2ea9055c: Fail gracefully on bad API call when building commit mail
rPcace777ba0db: Fail gracefully on bad API call when building commit mail
D13320: Dramatically limit the range of failures which can cause duplicate mail
T5791: Write Herald rules for outbound mail
D13319: Fail gracefully on bad API call when building commit mail
T6367: Send email with recipient's language and access levels, not sender's
D13123: Add "Mailing List" users
D13000: Make "Tags" and "Branches" custom fields work properly in Diffusion
Event Timeline
If you visit one of the commit pages, do you see the 'somebody accept a commit' message duplicated / other duplicated activity?
Here's the immediate trace:
Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] [2015-06-16 15:35:02] EXCEPTION: (PhutilProxyException) Error while executing Task ID 224593. {>} (ConduitClientException) ERR-INVALID-AUTH: User account is not permitted to use the API. at [<phutil>/src/conduit/ConduitFuture.php:58] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] arcanist(head=master, ref.master=7d15b85a1bc0), libcore(), phabricator(head=master, ref.master=ee8de4a9ec55), phutil(head=master, ref.master=92882eb9404d), services(head=master, ref.master=2d76591c4f87) Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #0 <#2> ConduitFuture::didReceiveResult(array) called at [<phutil>/src/future/FutureProxy.php:60] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #1 <#2> FutureProxy::getResult() called at [<phutil>/src/future/FutureProxy.php:37] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #2 <#2> FutureProxy::resolve() called at [<phutil>/src/conduit/ConduitClient.php:58] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #3 <#2> ConduitClient::callMethodSynchronous(string, array) called at [<phabricator>/src/applications/diffusion/conduit/DiffusionQueryConduitAPIMethod.php:123] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #4 <#2> DiffusionQueryConduitAPIMethod::execute(ConduitAPIRequest) called at [<phabricator>/src/applications/conduit/method/ConduitAPIMethod.php:90] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #5 <#2> ConduitAPIMethod::executeMethod(ConduitAPIRequest) called at [<phabricator>/src/applications/conduit/call/ConduitCall.php:134] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #6 <#2> ConduitCall::executeMethod() called at [<phabricator>/src/applications/conduit/call/ConduitCall.php:84] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #7 <#2> ConduitCall::execute() called at [<phabricator>/src/applications/repository/customfield/PhabricatorCommitBranchesField.php:34] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #8 <#2> PhabricatorCommitBranchesField::updateTransactionMailBody(PhabricatorMetaMTAMailBody, PhabricatorAuditEditor, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2607] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #9 <#2> PhabricatorApplicationTransactionEditor::addCustomFieldsToMailBody(PhabricatorMetaMTAMailBody, PhabricatorRepositoryCommit, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2537] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #10 <#2> PhabricatorApplicationTransactionEditor::buildMailBody(PhabricatorRepositoryCommit, array) called at [<phabricator>/src/applications/audit/editor/PhabricatorAuditEditor.php:646] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #11 <#2> PhabricatorAuditEditor::buildMailBody(PhabricatorRepositoryCommit, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2303] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #12 <#2> PhabricatorApplicationTransactionEditor::sendMailToTarget(PhabricatorRepositoryCommit, array, PhabricatorMailTarget) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2260] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #13 <#2> PhabricatorApplicationTransactionEditor::sendMail(PhabricatorRepositoryCommit, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1046] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #14 <#2> PhabricatorApplicationTransactionEditor::publishTransactions(PhabricatorRepositoryCommit, array) called at [<phabricator>/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php:21] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #15 <#2> PhabricatorApplicationTransactionPublishWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:91] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #16 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:162] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #17 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #18 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:185] Daemon 237 STDE [Tue, 16 Jun 2015 15:35:02 -0700] #19 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]
No, we do not see repeated activity on the commit page. Just a single 'accepted commit'.
FYI there are at least 3 users in my organization experiencing this flood of email.
The issue is roughly:
- We try to send mail to several users, including a disabled user.
- The first few users work fine, then we try to send to the disabled user.
- Generating the mail body makes an internal API call.
- The internal API call fails, either because of stricter rules around which users are permitted to establish API sessions, or possibly because of the introduction of the RepositoryBranch field (I believe we make no other API calls while building mail bodies).
- The task fails, then retries later.
This is a consequence primarily of T6367, and partially of either the stricter checks in D13123 or the API call in D13000.
I've manually cancelled the tasks which are retrying, so that should stop the email in the short term (at least, until another commit is made). I'll work on fixing the root issue now. At least in the short term, I think I'm just going to catch exceptions on the API call, since it could fail for other reasons.
D13319 should fix this specific issue.
I'd like to additionally:
- Catch and fail mail body construction issues as a whole, although this probably makes sense to hold until after the UI part of T5791, since it may produce hard-to-understand issues otherwise.
- Either mark mail we've sent as we go (so we never re-send mail) or hold all the mail to the end and then send it all at once. The latter of those approaches seems pretty clean, so I'm leaning in that direction. It isn't 100% bulletproof but the section of code that could fail and result in send duplicate mail is very small.
Repository tier is restarting now with D13319, so this should be fixed in about a minute.
I believe this is now resolved. Thanks for the report, and sorry for the duplicate mail. I've issued your instance a service credit for the last 48 hours, which will be reflected on your next bill.
Please let us know if you run into anything else.
Some more information, for future reference:
I assessed this issue by using bin/remote mysql admin001 and querying the admin_instances tables to identify which instance was affected. (It would be nice to have a cleaner way to do this, but when we eventually move the support queue to Nuance that should more or less happen automatically.)
I suspected this was a daemon stuck in a loop, so I used bin/remote ssh repo001 to connect to the repository tier and then PHABRICATOR_INSTANCE=... bin/phd log to examine the log after identifying the affected instance. This yielded the trace above, which confirmed that this was a daemon issue.
I used bin/remote ssh db001 to connect to the instance database and pulled the task out of the worker_activetask table using WHERE id = 224593 (from the log message). I noted the dataID and pulled that out of the worker_taskdata table; this contained a JSON blob of pertinent information, including mail recipients.
I selected the recipient user PHIDs from the ..._users.user table and verified that one had a nonstandard state (isDisabled flag was set), which suggested that the issue was straightforward and nothing tricky/subtle was going on.
I found all the tasks with repeated failures in the worker_activetask table and used PHABRICATOR_INSTANCE=... bin/worker cancel --id <id> to individually stop them (there were 4, with 43 failures each, which roughly squares with the numbers in the original report). This stopped the issue temporarily.
I implemented D13319 and deployed it after review, then restarted the tier. I believe this stopped the issue permanently (although not the class of issue). I examined the worker_activetask table again, and then again after a few minutes, to make sure no more tasks were getting stuck.
To follow up, I wrote D13320 (which should provide a more robust fix). I don't plan to deploy this immediately, since I think the branch fix will hold until the normal deployment cycle (it is very unusual to issue API calls from within custom fields).
We haven't had to touch the cluster midweek very much so far, but it is likely to get more common in the future. To make these kinds of deployments easier, I changed the deployment scripts to deploy a named branch instead of always deploying master, then realigned rP, rARC and rPHU to use stable and created stable branches in each repository based on the most recent (Week 24) changelog hashes, which I know to be stable other than this issue. I cherry-picked D13319 (rPbb7f2ea9) onto the stable branch (as rPcace777b) to create an actually-stable branch. I'll likely go clean up the less-robust patch on the repo tier later tonight to vet this a little more, ahead of the main weekly deployment.