Support multiple mail delivery services for automatic failover
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	May 5 2017, 3:45 PM

Description

Roughly 1-2 times per year our upstream mailer rejects mail for approximately a few hours. Previously:

In April 2015, T7746. MailGun assigned us to a shared IP address which was blacklisted in "SORBS-SPAM". I manually identified this and asked for advice and whether a dedicated IP was worthwhile to pursue. At the time, our mail volume was so low that we were advised a dedicated IP didn't make sense, but they switched us to a different shared IP. This appeared to resolve the issue.
In July 2015, T8781 . A small number of instances had mail queued because of operator error.
In March 2016, T10655. MailGun terminated our service without notice or explanation, apparently due to a bug. It was restored after a few hours.
In July 2016, some Phacility users reported mail going to "Spam" folders. I manually identified that the shared IP address was now on the "SPAMCANNIBAL" blacklist and engaged Mailgun support for a dedicated IP. They issued us a dedicated IP on the Barracuda blacklist, but resolved that fairly quickly. I believe this only affected a handful of users and didn't make it into a task.
In Feb 2017, T12237. SES abruptly terminated our service without notice in apparent violation of their written policy. This only affected secure.phabricator.com and was still sending through my personal account. We switched to MailGun.
In May 2017 (yesterday), MailGun changed their API abruptly and began rejecting mail it previously accepted without notice or guidance. Our behavior was legitimately wrong (D17831 approximately corrects it) but their response did not instill great confidence in me.

In other areas of the application, we've moved to improve reliability by allowing installs to list multiple services. Most recently, in T12450, we allowed installs to configure multiple search backends simultaneously so that, for example, queries could fall back to MySQL if ElasticSearch was not available.

We could support that for mail, as well: for example, allow mail to be configured to egress though ether SES or Mailgun, so that if one failed the other could handle the entire load until we could mail support to get them to turn things back on.

This could also let us potentially run a mailserver ourselves accepting a small fraction of the traffic (say, 1% of outbound mail). This is likely a very large amount of work, but the stakes would be lower if only a tiny amount of mail egressed through it and it was mostly intended as a watershed if both SES and Mailgun failed simultaneously, and we could better assess the challenges in a fairly safe way to get a better idea of what we're looking at. If it turns out that it really is incredibly hard we could just give up, but we currently have to do a lot of monitoring anyway and are subject to arbitrary service terminations and API changes.

The general shape of this change would be roughly similar to the shape of D17384, although it is likely far simpler.

We have yet to actually lose mail in any of these cases. Mail has only ever been delayed by a few hours.

Related work:

Before we pursue this, we should fix remaining bounces to unverified addresses, in T12491.
The patch in D17831 is not correct. In particular, user realnames may contain quotes, and the new syntax won't escape them correctly. In fact, your real name can be Hector" <xbox@example.gov>( which will generate the outbound address "hector (Hector" <xbox@example.gov>" <hector@aol.com>. Who knows how this will be interpreted.
The MailgunAdapter still has the wrong behavior when it composes a "Reply-To" address in addReplyTo() (no quoting).
T12404 (write our own tools for SMTP because PHPMailer managed to write SMTP tools with a RCE vulnerability) is also closely related here.

This would also make monitoring more difficult. The naive thing to monitor is queue size, but if mail has two outbound pathways and one fails queue size will remain small. We would likely need to separately log and monitor failures per outbound pathway. See also T7338.

Revisions and Commits

rP Phabricator
	D19052	rP9c8484de3208 Document the STMP port option
	D19043	rP09b446b269f5 Don't run older mail setup checks if "cluster.mailers" is configured
	D19042	rP7d4362690f1c Fix transposed name/email in Mailgun adapter
	D19009	rP19b3fb8863d6 Add a Postmark mail adapter so it can be configured as an outbound mailer
	D19007	rP1f53aa27e459 Add unit tests for mail failover behaviors when multiple mailers are configured
	D19006	rP9947eee182aa Add some test coverage for mailers configuration
	D19005	rP994d2e8e1563 Use "cluster.mailers" if it is configured
	D19004	rP4236952cdbc0 Add a `bin/config set <key> --stdin < value.json` flag to make CLI…
	D19003	rPc868ee9c07d0 Introduce and document a new `cluster.mailers` option for configuring multiple…
	D19002	rP7f2c90fbd12b Prepare for multiple mailers of the same type
	D18998	rP1485debcbda2 Prepare mail transmission to support failover across multiple mailers

Related Objects

Mentioned In: 2018 Week 6 (Early February)
T10655: Phacility cluster mail deliverability issue
T13053: Plans: Mail Tags and Failover
T13037: An attacker gained staff access to Mailgun and was able to read customer API keys
T12847: A Pathway Towards Private Clusters
Mentioned Here: D17384: Support multiple fulltext search clusters with 'cluster.search' config
D17831: Explicitly quote "From" name part when submitting mail to the Mailgun API
T7338: Build more status tools for monitoring Phacility cluster health
T7746: Phacility mail is going to "Spam" for some Gmail users
T8781: Not receiving emails
T10655: Phacility cluster mail deliverability issue
T12237: Amazon SES has suspended outbound mail from secure.phabricator.com
T12404: Implement a first-party SMTP client
T12450: New Search Configuration Errata
T12491: Error reply emails which are generated before identifying the sender should no longer be sent, now that the "always require verification" rule is in place

Event Timeline

epriestley created this task.May 5 2017, 3:45 PM

Herald added subscribers: chad, eadler. · View Herald TranscriptMay 5 2017, 3:45 PM

revi added a subscriber: revi.May 6 2017, 4:35 PM

joshuaspence added a subscriber: joshuaspence.May 11 2017, 12:27 PM

epriestley mentioned this in T12847: A Pathway Towards Private Clusters.Jun 15 2017, 9:38 PM

epriestley mentioned this in T13037: An attacker gained staff access to Mailgun and was able to read customer API keys.Jan 5 2018, 8:20 PM

epriestley mentioned this in T13053: Plans: Mail Tags and Failover.Jan 27 2018, 9:52 PM

epriestley moved this task from Backlog to Stamps/Failover on the Mail board.

epriestley mentioned this in T10655: Phacility cluster mail deliverability issue .Jan 30 2018, 7:14 PM

epriestley added a revision: D18998: Prepare mail transmission to support failover across multiple mailers.Feb 5 2018, 10:44 PM

epriestley added a revision: D19002: Prepare for multiple mailers of the same type.Feb 6 2018, 1:09 PM

epriestley added a revision: D19003: Introduce and document a new `cluster.mailers` option for configuring multiple mailers.Feb 6 2018, 2:06 PM

epriestley added a revision: D19004: Add a `bin/config set <key> --stdin < value.json` flag to make CLI configuration of complex values easier.Feb 6 2018, 2:23 PM

epriestley added a revision: D19005: Use "cluster.mailers" if it is configured.Feb 6 2018, 2:52 PM

epriestley added a revision: D19006: Add some test coverage for mailers configuration.Feb 6 2018, 4:16 PM

epriestley added a revision: D19007: Add unit tests for mail failover behaviors when multiple mailers are configured.Feb 6 2018, 4:43 PM

epriestley updated the task description. (Show Details)Feb 6 2018, 5:21 PM

epriestley added a revision: D19009: Add a Postmark mail adapter so it can be configured as an outbound mailer.Feb 6 2018, 5:44 PM

epriestley added a commit: rP1485debcbda2: Prepare mail transmission to support failover across multiple mailers.Feb 8 2018, 1:49 PM

epriestley added a commit: rP7f2c90fbd12b: Prepare for multiple mailers of the same type.Feb 8 2018, 2:01 PM

epriestley added a commit: rPc868ee9c07d0: Introduce and document a new `cluster.mailers` option for configuring multiple….Feb 8 2018, 2:08 PM

epriestley added a commit: rP4236952cdbc0: Add a `bin/config set <key> --stdin < value.json` flag to make CLI….

epriestley added a commit: rP994d2e8e1563: Use "cluster.mailers" if it is configured.Feb 8 2018, 2:13 PM

epriestley added a commit: rP9947eee182aa: Add some test coverage for mailers configuration.Feb 8 2018, 2:17 PM

epriestley added a commit: rP1f53aa27e459: Add unit tests for mail failover behaviors when multiple mailers are configured.

epriestley added a commit: rP19b3fb8863d6: Add a Postmark mail adapter so it can be configured as an outbound mailer.

epriestley added a revision: D19042: Fix transposed name/email in Mailgun adapter.Feb 9 2018, 1:27 AM

PhabricatorMailSetupCheck should be removed, but some of the SES stuff should stick around.

epriestley added a commit: rP7d4362690f1c: Fix transposed name/email in Mailgun adapter.Feb 9 2018, 1:49 AM

epriestley added a revision: D19043: Don't run older mail setup checks if "cluster.mailers" is configured.Feb 9 2018, 1:50 AM

epriestley added a commit: rP09b446b269f5: Don't run older mail setup checks if "cluster.mailers" is configured.Feb 9 2018, 1:51 AM

"Moo", said the cow.

We now have Postmark configured as a primary mailer for this domain with inbound (via MX record) and outbound (via cluster.mailers) failover to Mailgun. 📫

"Quack, quack" said the duck.

(I've muted this task so I shouldn't get mail about it.)

Seems OK:

Screen Shot 2018-02-08 at 6.16.49 PM.png (802×1 px, 113 KB)

epriestley added a revision: D19052: Document the STMP port option.Feb 9 2018, 10:42 PM

epriestley added a commit: rP9c8484de3208: Document the STMP port option.Feb 9 2018, 10:49 PM

This hasn't blown up in 24 hours and is about to promote, so anything else can be handled in followups.

epriestley mentioned this in Unknown Object (Phriction Wiki Document).Feb 10 2018, 1:02 AM

epriestley mentioned this in 2018 Week 6 (Early February).Feb 10 2018, 1:16 AM

	F5419941: Screen Shot 2018-02-08 at 6.16.49 PM.png
	Feb 9 2018, 2:17 AM

Support multiple mail delivery services for automatic failoverClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Support multiple mail delivery services for automatic failover
Closed, ResolvedPublic
Actions