Prepare mail transmission to support failover across multiple mailers
ClosedPublic
Actions

Authored by epriestley on Feb 5 2018, 10:44 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T12677: Support multiple mail delivery services for automatic failover
T13053: Plans: Mail Tags and Failover

Commits

rP1485debcbda2: Prepare mail transmission to support failover across multiple mailers

Summary

Ref T13053. Ref T12677. This restructures the calls and error handling logic so that we can pass in a list of multiple mailers and get retry logic.

This doesn't actually ever use multiple mailers yet, and shouldn't change any behavior. I'll add multiple-mailer coverage a little further in, since there's currently no way to effectively test which of several mailers ended up transmitting a message.

Test Plan

This has test coverage; tests still pass.
Poked around locally doing things that send mail, saw mail appear to send. I'm not attached to a real mailer though so my confidence in local testing is only so-so.

Diff Detail

Repository

rP Phabricator

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.Feb 5 2018, 10:44 PM

Harbormaster completed remote builds in B19350: Diff 45557.Feb 5 2018, 10:45 PM

epriestley requested review of this revision.Feb 5 2018, 10:45 PM

Rebase on top of the "Mail Stamps" changes to pick up the "multiplex" change, etc.

Harbormaster completed remote builds in B19357: Diff 45564.Feb 6 2018, 12:25 PM

epriestley added a child revision: D19000: Mask the sender for "Must Encrypt" mail.Feb 6 2018, 12:34 PM

amckinley accepted this revision.Feb 6 2018, 6:56 PM

amckinley added inline comments.

src/applications/metamta/storage/PhabricatorMetaMTAMail.php
511–516	What are the implications of this? Are we double-super-sure that some random mailer won't return "permanent" failure status codes for random transient issues?
537–541	I guess this is kinda like counting retries. It would be nice to guarantee that all emails will eventually leave the queue (especially since `$messages` seems like it can grow without bound under the current implementation).
801–802	OK, so these statuses get persisted up at line 485 after we invoke `buildMailer()`. So `return null;` means "never invoke this mailer again for this message", but raising an exception in `buildMailer()` is considered a transient failure of the particular mailer, and should be retried?

This revision is now accepted and ready to land.Feb 6 2018, 6:56 PM

epriestley added inline comments.Feb 6 2018, 7:14 PM

src/applications/metamta/storage/PhabricatorMetaMTAMail.php
511–516	We control when mailers return permanent failure, so we should be able to make sure they're only real permanent failures, for some definition of "permanent". Today, I believe no actual production mailers ever return permanent failures, although the test mailer can and tests cover the behavior (and someone might have written a third-party mailer that does? I may be misremembering or making this up). So this is more "here's something that works if we actually identify a true permanent failure", and "I'm trying not to change too much stuff while rewriting everything" than a real mechanism. Possibly we should just delete this stuff since it's not clear there are really any use cases for it today or in the future. I think I'd sort of imagined, say, internal corporate mailers delivering directly to mailboxes, so the server says "no, this mailbox definitely does not exist and I will never accept the mail, go away and stop bothering me". I'm not sure if that ever happens. It might also be reasonable to make permanent bounces trigger permanent failures. For example, in the Mailgun adapter, we could test if the outbound address is on Mailgun's permanent bounce list before trying to deliver the mail. Bounce list management would also probably be a separate thing in an ideal world, though.
537–541	There's a retry/permanent failure layer above this at the WorkerTask level that counts retires and does backoff. If the task fails, we sleep for 1, 2, 3, 4, ... seconds before trying again (I think, might have the numbers slightly off since I'm just going from memory) and then fail forever after 250 retries. We're sleeping for 250 seconds between retries by the end, and the 250th retry happens ~4 days later, I think, if I remember the math right. `$messages` gets reset/cleared each time through, so it can only have a maximum of one message from each mailer at any given time. It's "what went wrong the most recent time we tried to send this". When we throw the exception below that'll end up in the daemon log, so you can see "what are all the reasons this has ever failed" in `bin/phd log` until that gets GC'd.
801–802	Yep.