Page MenuHomePhabricator

Disabled account should disable all outbound mail
Closed, WontfixPublic

Description

Right now when you disable an account the UI informs you that it also disables outbound e-mail notifications. And it does, for things like maniphest notifications and the like.

However, "confirm your e-mail" e-mails don't seem to be caught by this and the MTA jobs get stuck in their fail/retry loop rather than giving up permanently on the next attempt.

Example: I disabled a bogus user when the "confirm your e-mail" MTA job was on its ~200th attempt. It's now at its ~240th attempt :) Here's some stacktraces of the e-mail failure for extra context:

./bin/mail show-outbound --id 5487144
PROPERTIES
ID: 5487144
Status: queued
Related PHID: PHID-USER-hjtpcywvccalxztnwqul
Message: SMTP Error: The following recipients failed: {bogus e-mail snipped}
#0 /srv/phab/phabricator/externals/phpmailer/class.phpmailer.php(576): PHPMailer->SmtpSend('Date: Fri, 8 Ja...', 'Hi {bogus user}??Pl...')
#1 /srv/phab/phabricator/src/applications/metamta/adapter/PhabricatorMailImplementationPHPMailerAdapter.php(120): PHPMailer->Send()
#2 /srv/phab/phabricator/src/applications/metamta/storage/PhabricatorMetaMTAMail.php(733): PhabricatorMailImplementationPHPMailerAdapter->send()
#3 /srv/phab/phabricator/src/applications/metamta/PhabricatorMetaMTAWorker.php(26): PhabricatorMetaMTAMail->sendNow()
#4 /srv/phab/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): PhabricatorMetaMTAWorker->doWork()
#5 /srv/phab/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php(171): PhabricatorWorker->executeTask()
#6 /srv/phab/phabricator/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php(22): PhabricatorWorkerActiveTask->executeTask()
#7 /srv/phab/libphutil/src/daemon/PhutilDaemon.php(183): PhabricatorTaskmasterDaemon->run()
#8 /srv/phab/libphutil/scripts/daemon/exec/exec_daemon.php(125): PhutilDaemon->execute()
#9 {main}

PARAMETERS
sensitive: 1
raw-to: ["{bogus e-mail snipped}"]
force: 1
subject: [Phabricator] Email Verification
actors.sent: []
routing.sent: null
routingmap.sent: []

HEADERS
X-Phabricator-Sent-This-Message: Yes
X-Mail-Transport-Agent: MetaMTA
X-Auto-Response-Suppress: All

TEXT BODY
Hi {bogus user}

Please verify that you own this email address ({bogus e-mail snipped}) by clicking this link:

  https://phabricator.wikimedia.org/emailverify/{snip key}/



HTML BODY
(This message has no HTML body.)

Event Timeline

eadler added a project: Restricted Project.Jan 11 2016, 9:38 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 11 2016, 9:42 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 9:15 PM
eadler moved this task from Backlog to Nits on the llvm board.
epriestley claimed this task.
epriestley added a subscriber: epriestley.

Example: I disabled a bogus user when the "confirm your e-mail" MTA job was on its ~200th attempt. It's now at its ~240th attempt :)

This job retries 250 times, then gives up. Retries are after 15s, 30s, 45s, etc. The final tries are about an hour apart, roughly a week later. Presumably, this job gave up a little less than 10 hours after these observations. (You can review the retry schedule, count, and remaining retries in the Daemon console in the web UI.)

These parameters were selected so you don't lose any outbound mail if you have a flaky / unavailable MTA: as long as it comes back up within a few days, everything will go back to normal. (If not, you have a week to stop the queue and figure out a recovery plan).

This week-long, 250-retry timeline was selected empirically, by observing a world-class, professional operations team's ability to run an MTA reliably at Facebook. Facebook's internal MTA regularly dropped as much as 97% of mail and had multi-day outages every couple weeks for a period of more than three months, so I can only deduce that running an MTA is a task that lies at the very extreme edge of human skill. Outside Facebook, two different providers (T10655, T12237) have both abruptly terminated our mail service without notice in the past year. In both cases we were able to work around the issue and not lose any mail once it was resolved, partly because of the conservative tuning of mail transmission parameters.

So I think we're prioritizing the right things here ("try very very very hard to deliver mail") given that the operational cost of this is pretty small (a slightly noisy logfile) while the operational cost of dropping mail when there's an MTA issue is awful, and my experience is that MTAs really do fall over somewhat consistently in the real world. From an operations perspective, I'm greatly comforted by knowing that we're tuned this way whenever mail flakes out since I know I just have to fix the problem and the system will self-heal without losing any mail.

This task was filed some time ago, and a few things are different now. At HEAD of master:

  • you can cancel the particular job with bin/worker cancel --id <ID>;
  • we will no longer send mail (except verification / password reset mail) to unverified addresses, so this sort of thing should only put one bad message in the queue;
  • you can unverify addresses which were once good but are now bad (for example, if they have begun bouncing) with bin/mail unverify <address>. This will stop future mail to this address (except verification / reset mail).

This particular case (the email is verify mail, not normal mail) is unusual. Most mail is sent to users, and later evaluated to mean "the user's primary address" and apply a bunch of settings which may stop delivery, including stopping delivery if the user is disabled (or, now, if their address is unverified).

Verify mail, password reset mail, invite mail, and a few other types of special mail are sent to raw email addresses. We don't currently check if the address happens to be associated with a user account to try to apply delivery modifications. Although we reasonably could do this, I think the complexity isn't worthwhile given that these tasks do eventually solve themselves (after about a week), are presumably very rare, should be even rarer now than they were in the past, and that other mechanisms now exist to remedy this issue.

We could try harder to stop this mail and I'm not totally opposed to doing so, but I'd like to see evidence that this is still a concern after the other changes before introducing more complexity.

(I'd also maybe like to try putting retry information in more places, since I think no one would care about this if these errors had "Delivery will retry 73 more times before failing permanently." -- they do say that now, but only in the web UI.)