Page MenuHomePhabricator

Prepare mail transmission to support failover across multiple mailers
ClosedPublic

Authored by epriestley on Feb 5 2018, 10:44 PM.

Details

Summary

Ref T13053. Ref T12677. This restructures the calls and error handling logic so that we can pass in a list of multiple mailers and get retry logic.

This doesn't actually ever use multiple mailers yet, and shouldn't change any behavior. I'll add multiple-mailer coverage a little further in, since there's currently no way to effectively test which of several mailers ended up transmitting a message.

Test Plan
  • This has test coverage; tests still pass.
  • Poked around locally doing things that send mail, saw mail appear to send. I'm not attached to a real mailer though so my confidence in local testing is only so-so.

Diff Detail

Repository
rP Phabricator
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

epriestley created this revision.Feb 5 2018, 10:44 PM
epriestley requested review of this revision.Feb 5 2018, 10:45 PM
epriestley updated this revision to Diff 45564.Feb 6 2018, 12:24 PM
  • Rebase on top of the "Mail Stamps" changes to pick up the "multiplex" change, etc.
amckinley accepted this revision.Feb 6 2018, 6:56 PM
amckinley added inline comments.
src/applications/metamta/storage/PhabricatorMetaMTAMail.php
511–516

What are the implications of this? Are we double-super-sure that some random mailer won't return "permanent" failure status codes for random transient issues?

537–541

I guess this is kinda like counting retries. It would be nice to guarantee that all emails will eventually leave the queue (especially since $messages seems like it can grow without bound under the current implementation).

801–802

OK, so these statuses get persisted up at line 485 after we invoke buildMailer().

So return null; means "never invoke this mailer again for this message", but raising an exception in buildMailer() is considered a transient failure of the particular mailer, and should be retried?

This revision is now accepted and ready to land.Feb 6 2018, 6:56 PM
epriestley added inline comments.Feb 6 2018, 7:14 PM
src/applications/metamta/storage/PhabricatorMetaMTAMail.php
511–516

We control when mailers return permanent failure, so we should be able to make sure they're only real permanent failures, for some definition of "permanent".

Today, I believe no actual production mailers ever return permanent failures, although the test mailer can and tests cover the behavior (and someone might have written a third-party mailer that does? I may be misremembering or making this up). So this is more "here's something that works if we actually identify a true permanent failure", and "I'm trying not to change too much stuff while rewriting everything" than a real mechanism. Possibly we should just delete this stuff since it's not clear there are really any use cases for it today or in the future.

I think I'd sort of imagined, say, internal corporate mailers delivering directly to mailboxes, so the server says "no, this mailbox definitely does not exist and I will never accept the mail, go away and stop bothering me". I'm not sure if that ever happens.

It might also be reasonable to make permanent bounces trigger permanent failures. For example, in the Mailgun adapter, we could test if the outbound address is on Mailgun's permanent bounce list before trying to deliver the mail. Bounce list management would also probably be a separate thing in an ideal world, though.

537–541

There's a retry/permanent failure layer above this at the WorkerTask level that counts retires and does backoff. If the task fails, we sleep for 1, 2, 3, 4, ... seconds before trying again (I think, might have the numbers slightly off since I'm just going from memory) and then fail forever after 250 retries. We're sleeping for 250 seconds between retries by the end, and the 250th retry happens ~4 days later, I think, if I remember the math right.

$messages gets reset/cleared each time through, so it can only have a maximum of one message from each mailer at any given time. It's "what went wrong the most recent time we tried to send this".

When we throw the exception below that'll end up in the daemon log, so you can see "what are all the reasons this has ever failed" in bin/phd log until that gets GC'd.

801–802

Yep.

This revision was automatically updated to reflect the committed changes.