**Summary**: Between May 13th at 6PM PST [1] and May 14th at 6PM PST, the bastion host (which allows administrative access to the cluster) stopped responding. The `admin` tier also connects through the bastion in order to perform tasks on cluster hosts (backups, restarts, other provisioning). These tasks hung in the queue indefinitely, failing to complete and stopping some other completable tasks (sending invite/reset email) from completing. Access to instances was not affected, other than being unable to perform password resets.
To resolve this issue, I provisioned a new bastion host and restarted the daemons. This worked as designed, and the task queue completed successfully once the issue was resolved.
**Problems**: Briefly:
- (T8204) The bastion host shouldn't have stopped responding.
- (T8206) Provisioning a new bastion host took way too long (about 55 minutes).
- (T8205) We should see if we can reduce the impact of the bastion hanging.
**Bastion Host Stopped**: See T8204. The bastion was a `micro` instance, and this is now the third (maybe fourth?) micro instance I've seen that abruptly stopped responding to any traffic. In particular, the server accepts connections and then hangs indefinitely. From the AWS console, these hosts continue to report that they're healthy/reachable, have no CPU usage, have plenty of free CPU credits, etc, so I don't believe this is a throttling issue (and my expectation is that throttling degrades performance, rather than stopping the instance entirely). The bastion host had no appreciable workload.
Restarting the instance from the console also failed, although the console started reporting a health check failure some time after I performed the restart.
I've found some other references to similar problems by Googling, but nothing as concrete or high-impact as what I've experienced -- I think I have yet to have a `micro` instance make it for more than about 6 months with any interesting or useful workload without running into an abrupt total instance failure.
Since these failures aren't predictable or reproducible and the hosts aren't reachable after a failure occurs, I don't know how we'd go about fixing, anticipating or preventing them.
To move forward, I plan to stop using `micro` hosts in any role in the cluster. We currently have `micro` hosts deployed in `bastion` and `vault` roles (both of these roles have no meaningful workload and just proxy SSH connections).
I'll move the `bastion` to less-haunted hardware immediately and move `vault` once I'm confident the other changes here are stable.
**Provisioning**: See T8206. It took me about 55 minutes to get from identifying the bastion failure to resolving the issue. Some problems I encountered along the way:
- Bastion account provisioning is not a stateful effect of deployment. This was the major issue, and meant that I spent a lot of time rebuilding accounts on the new bastion. Currently, access to the bastion is authorized per-account with `bin/remote authorize`. In retrospect, this is an architectural mistake; instead, the authorized account state should be encoded in the deployment process and newly deployed bastions should come up with the right authorizations. We're consistent about this approach with most other aspects of cluster deployment.
- Because I knew provisioning was not stateful for account access and redeployment might be involved, I wasted time trying to revive the bastion. I should have started cycling it immediately once I had good confidence that the host was the issue. Generally, I think it's right to have a strong bias toward swapping hardware, and we should work to remove or avoid concerns which make this course of action less attractive.
- Bastion deployment encountered a minor issue with the interaction between deployment steps. This should have been caught and resolved at an earlier time, not during a cluster incident.
- Swapping bastions relies on two DNS records which had 300s TTLs. One was inconvenient to rebind.
- Doing this kind of work in the EC2 console is slow, particularly because you can't have two windows open at the same time or it logs you out (one with Route 53, one with EC2, so you can copy IP addresses from one to the other).
To resolve these issues, I plan to: make authorization part of deployment state; change deployment procedure to deploy the bastion regularly (we should automatically deploy all services regularly soon, but this may be premature); and adjust DNS to better anticipate this situation.
**Impact Reduction**: See T8205. Major issue here is that `backup` tasks (and other similar tasks which route through the bastion) don't have any timeouts, so they gummed up the queue when they started hanging indefinitely.
- We could put some time limit on them, but eventually large instances may legitimately take a long time to back up.
- We could try using the SSH `ConnectTimeout`, but I'm not sure this would have fixed things (hosts in this failure state might be accepting connections and then hanging indefinitely, so they'd never hit the ConnectTimeout). I'm not sure I can reproduce the issue in order to test it, either.
- We could require active health checks during long-running commands, but this is a bit complicated.
- We can generally increase queue visibility.
I don't have a specific plan for what I want to do here yet and need to look at the options a bit more.
[1] It is likely that this failure was much closer to the end of the range than the beginning, it's just somewhat hard to pin down from logs. See T7338 for general monitoring.