Page MenuHomePhabricator

Support multiple mail delivery services for automatic failover
Closed, ResolvedPublic

Description

Roughly 1-2 times per year our upstream mailer rejects mail for approximately a few hours. Previously:

  • In April 2015, T7746. MailGun assigned us to a shared IP address which was blacklisted in "SORBS-SPAM". I manually identified this and asked for advice and whether a dedicated IP was worthwhile to pursue. At the time, our mail volume was so low that we were advised a dedicated IP didn't make sense, but they switched us to a different shared IP. This appeared to resolve the issue.
  • In July 2015, T8781 . A small number of instances had mail queued because of operator error.
  • In March 2016, T10655. MailGun terminated our service without notice or explanation, apparently due to a bug. It was restored after a few hours.
  • In July 2016, some Phacility users reported mail going to "Spam" folders. I manually identified that the shared IP address was now on the "SPAMCANNIBAL" blacklist and engaged Mailgun support for a dedicated IP. They issued us a dedicated IP on the Barracuda blacklist, but resolved that fairly quickly. I believe this only affected a handful of users and didn't make it into a task.
  • In Feb 2017, T12237. SES abruptly terminated our service without notice in apparent violation of their written policy. This only affected secure.phabricator.com and was still sending through my personal account. We switched to MailGun.
  • In May 2017 (yesterday), MailGun changed their API abruptly and began rejecting mail it previously accepted without notice or guidance. Our behavior was legitimately wrong (D17831 approximately corrects it) but their response did not instill great confidence in me.

In other areas of the application, we've moved to improve reliability by allowing installs to list multiple services. Most recently, in T12450, we allowed installs to configure multiple search backends simultaneously so that, for example, queries could fall back to MySQL if ElasticSearch was not available.

We could support that for mail, as well: for example, allow mail to be configured to egress though ether SES or Mailgun, so that if one failed the other could handle the entire load until we could mail support to get them to turn things back on.

This could also let us potentially run a mailserver ourselves accepting a small fraction of the traffic (say, 1% of outbound mail). This is likely a very large amount of work, but the stakes would be lower if only a tiny amount of mail egressed through it and it was mostly intended as a watershed if both SES and Mailgun failed simultaneously, and we could better assess the challenges in a fairly safe way to get a better idea of what we're looking at. If it turns out that it really is incredibly hard we could just give up, but we currently have to do a lot of monitoring anyway and are subject to arbitrary service terminations and API changes.

The general shape of this change would be roughly similar to the shape of D17384, although it is likely far simpler.

We have yet to actually lose mail in any of these cases. Mail has only ever been delayed by a few hours.


Related work:

  • Before we pursue this, we should fix remaining bounces to unverified addresses, in T12491.
  • The patch in D17831 is not correct. In particular, user realnames may contain quotes, and the new syntax won't escape them correctly. In fact, your real name can be Hector" <xbox@example.gov>( which will generate the outbound address "hector (Hector" <xbox@example.gov>" <hector@aol.com>. Who knows how this will be interpreted.
  • The MailgunAdapter still has the wrong behavior when it composes a "Reply-To" address in addReplyTo() (no quoting).
  • T12404 (write our own tools for SMTP because PHPMailer managed to write SMTP tools with a RCE vulnerability) is also closely related here.

This would also make monitoring more difficult. The naive thing to monitor is queue size, but if mail has two outbound pathways and one fails queue size will remain small. We would likely need to separately log and monitor failures per outbound pathway. See also T7338.

Event Timeline

  • PhabricatorMailSetupCheck should be removed, but some of the SES stuff should stick around.

We now have Postmark configured as a primary mailer for this domain with inbound (via MX record) and outbound (via cluster.mailers) failover to Mailgun. 📫

(I've muted this task so I shouldn't get mail about it.)

epriestley claimed this task.

This hasn't blown up in 24 hours and is about to promote, so anything else can be handled in followups.

epriestley mentioned this in Unknown Object (Phriction Wiki Document).Feb 10 2018, 1:02 AM