Page MenuHomePhabricator

Not receiving emails
Closed, ResolvedPublic


Our Phab server went down yesterday and since it has come back up, no one is receiving notification emails



Event Timeline

scottybollinger raised the priority of this task from to Needs Triage.
scottybollinger updated the task description. (Show Details)
scottybollinger added a subscriber: scottybollinger.

I think the immediate issue is now resolved, I'll look at this in more detail and post a followup a little later.

Thanks! We are receiving emails now

epriestley claimed this task.

This looks like it was primarily caused by operator error. I stopped some services on the paired repo host in order to recover db001 in connection with T8764 yesterday. This worked, and let me revive the host immediately rather than needing to restart it, but I then failed to completely restart services later.

Because we cache setup issue state and I didn't cycle any hosts in the web tier, this also didn't get caught by setup warnings in spot testing of instances.

The short term fix was just to cycle the host properly, which restored services.

I think the general longer term fix here is more granular monitoring of instance state, discussed in T7338, to make this kind of error impossible to miss. It's desirable that I was able to intervene and perform a nonstandard service termination, but monitoring should have made it clear that services didn't fully cycle and needed to. I'll make some notes there about specific things which would have been helpful in detecting this during the incident response.

Not directly related, but a more formal support queue via Nuance (T8783) with SMS integration (T920) would also have helped here, since I was available but not at my desk for the 35 minutes between the report and my response. These are on the roadmap (and SMS pretty much works right now, so maybe we can get that hooked up sooner) but probably at least a month or two out.