Page MenuHomePhabricator

Phacility cluster mail deliverability issue
Closed, ResolvedPublic

Description

Earlier today, our current upstream mail provider suddenly disabled our account without notice or explanation. As a result, outbound mail is no longer being delivered. We've followed up with them (at 11:06 AM PST) but haven't received a reply yet.

In the meantime, I'm looking at transitioning outbound mail to a different provider (say, one that doesn't terminate service for long-term paying customers without contacting them). We didn't anticipate our provider just turning us off without warning, so we didn't have a redundant provider ready. Amazon SES is the most promising alternative provider, but has default service limits which are too low to accommodate our current outbound volume. I submitted a support request to raise these limits (at 11:49 AM PST) but haven't received a reply yet.

So we're waiting for either provider to get back to us and give us an outbound pathway. Outbound mail is currently queued in Phabricator, and will all be delivered once an outbound pathway is available.

There are minor technical barriers to delivering via SES in the cluster. I'm working to remove these while we wait for a reply so we can transition to SES immediately if that pathway unblocks first.

Event Timeline

Our original upstream provider enabled the account again, also without explanation. Outbound queues are flushing now, but will take some time to completely de-queue because of backoff behavior in the queue.

I'm pursuing an explanation with our provider about the root cause of this issue.

I've issued all instances a 24-hour credit for the service disruption. This will be reflected on the next invoice you receive.

epriestley lowered the priority of this task from Unbreak Now! to Normal.Mar 23 2016, 10:20 PM

The upstream provider gave me a not-quite-English canned response:

The reason for the disablement was due to our automated system seeing emails with spam fingerprints on them for your sending and receiving. We had whitelisted the domain fingerprint to prevent this exact issue moving forward.

As for not being notified that our account was disabled:

Our Notifications are currently stuck in a bug of not getting sent out, we are working on this at this time.

I'm not particularly thrilled with this response and plan to move us to SES.

I requested SES limits of about 2X what we currently need and AWS gave us about 10X, so things look good so far on that front.

I'm going to roll this forward into T12677 since SES managed to one-up this by a healthy margin in T12237.