Page MenuHomePhabricator

Prepare mail transmission to support failover across multiple mailers

Authored by epriestley on Feb 5 2018, 10:44 PM.
Referenced Files
Unknown Object (File)
Fri, Feb 9, 11:13 AM
Unknown Object (File)
Jan 30 2024, 6:18 PM
Unknown Object (File)
Dec 25 2023, 3:34 AM
Unknown Object (File)
Dec 21 2023, 11:14 PM
Unknown Object (File)
Nov 30 2023, 7:07 PM
Unknown Object (File)
Nov 30 2023, 12:36 AM
Unknown Object (File)
Nov 25 2023, 6:15 PM
Unknown Object (File)
Nov 15 2023, 8:18 AM



Ref T13053. Ref T12677. This restructures the calls and error handling logic so that we can pass in a list of multiple mailers and get retry logic.

This doesn't actually ever use multiple mailers yet, and shouldn't change any behavior. I'll add multiple-mailer coverage a little further in, since there's currently no way to effectively test which of several mailers ended up transmitting a message.

Test Plan
  • This has test coverage; tests still pass.
  • Poked around locally doing things that send mail, saw mail appear to send. I'm not attached to a real mailer though so my confidence in local testing is only so-so.

Diff Detail

rP Phabricator
Lint Not Applicable
Tests Not Applicable

Event Timeline

  • Rebase on top of the "Mail Stamps" changes to pick up the "multiplex" change, etc.
amckinley added inline comments.

What are the implications of this? Are we double-super-sure that some random mailer won't return "permanent" failure status codes for random transient issues?


I guess this is kinda like counting retries. It would be nice to guarantee that all emails will eventually leave the queue (especially since $messages seems like it can grow without bound under the current implementation).


OK, so these statuses get persisted up at line 485 after we invoke buildMailer().

So return null; means "never invoke this mailer again for this message", but raising an exception in buildMailer() is considered a transient failure of the particular mailer, and should be retried?

This revision is now accepted and ready to land.Feb 6 2018, 6:56 PM

We control when mailers return permanent failure, so we should be able to make sure they're only real permanent failures, for some definition of "permanent".

Today, I believe no actual production mailers ever return permanent failures, although the test mailer can and tests cover the behavior (and someone might have written a third-party mailer that does? I may be misremembering or making this up). So this is more "here's something that works if we actually identify a true permanent failure", and "I'm trying not to change too much stuff while rewriting everything" than a real mechanism. Possibly we should just delete this stuff since it's not clear there are really any use cases for it today or in the future.

I think I'd sort of imagined, say, internal corporate mailers delivering directly to mailboxes, so the server says "no, this mailbox definitely does not exist and I will never accept the mail, go away and stop bothering me". I'm not sure if that ever happens.

It might also be reasonable to make permanent bounces trigger permanent failures. For example, in the Mailgun adapter, we could test if the outbound address is on Mailgun's permanent bounce list before trying to deliver the mail. Bounce list management would also probably be a separate thing in an ideal world, though.


There's a retry/permanent failure layer above this at the WorkerTask level that counts retires and does backoff. If the task fails, we sleep for 1, 2, 3, 4, ... seconds before trying again (I think, might have the numbers slightly off since I'm just going from memory) and then fail forever after 250 retries. We're sleeping for 250 seconds between retries by the end, and the 250th retry happens ~4 days later, I think, if I remember the math right.

$messages gets reset/cleared each time through, so it can only have a maximum of one message from each mailer at any given time. It's "what went wrong the most recent time we tried to send this".

When we throw the exception below that'll end up in the daemon log, so you can see "what are all the reasons this has ever failed" in bin/phd log until that gets GC'd.



This revision was automatically updated to reflect the committed changes.