Apr 20 2022
There's nothing particularly useful or actionable here now, so closing it out. (I believe this was the most severe incident Phacility ever experienced while actively maintained.)
This hasn't caused any more problems in like 4 years, so I guess it's kind of whatever.
Dec 19 2021
After T13630:
Feb 26 2021
Enable the SSH and HTTP application logs on the web, repo and admin tiers.
May 26 2020
Continued in T13542.
The major offender here (services per instance) was fixed by updating caching, and I destroyed all the old services. This is perhaps spiritually continued in T13542.
Continued in T13542.
Continued in T13542. I wrote a Terraform/CloudFormation-style service in PHP over the last couple of days.
Continued in T13542.
Feb 3 2020
Both of these hosts restarted cleanly.
AWS is also rebooting web007.
Jan 30 2020
Jan 15 2020
This went through cleanly.
Nov 26 2019
there is no way to bin/host query against the set of instances using a particular repository shard service
Nov 25 2019
Update Almanac definitions for all instances not on the paired db023 shard.
PHI1566 is resolved narrowly. These cleanup steps still need to happen.
(Updating addresses with bin/host query leaves the service address cache dirty (the "mutable structure cache" via PhabricatorRepository->getAlmanacServiceRefs()) so it should be followed with bin/cache purge --caches general.)
I'll flesh this out more later, but the move away from db123 = repo123 shard pairing, plus bin/host query using mysql makes it difficult to directly query instances using a particular repository service.
Minor issue that should be looked at during service sync arising from improved validation elsewhere:
I'm deploying the new host now. We just crossed a release so I'm going to manually restore it to 72f82abe07 once it comes up (see also T13359). Then, I'll resynchronize instance services for active instances.
Instance termination completed after about 20 minutes and all the volumes detached. Since the original instance can be recycled, I'm going to reattach and restart it, and throw away the replacement host.
Normal volume detachment is just spinning, which isn't exactly surprising. I'm going to give it a few minutes and then force detachment.
To deal with this narrowly, I'm going to:
Aug 11 2019
Aug 1 2019
Jul 30 2019
Jul 24 2019
Adjusting log_warnings = 2 in production (to get connection aborts into the error log) is also possibly desirable, although the background level of connection abortions (general network flakiness, server restarts during deploy, wait_timeout on very long-running demons?) may be high enough that this is more noise than signal.
Jul 23 2019
We could also consider these things:
So actual actionable stuff here is:
Jul 22 2019
Bumping max_allowed_packet to 1G in the server config resolved things. The export process then spent a long time doing a bin/files migration (which could use a progress bar, maybe) and is now doing a dump (which could too, although I'm less sure of how we'd build one).
190722 18:55:55 [Warning] Aborted connection 6 to db: '<instance>_differential' user: 'root' host: 'localhost' (Got a packet bigger than 'max_allowed_packet' bytes)
I adjusted innodb_log_file_size to 1GB and attempted the import again, but ran into the same issue.
The "age of the last checkpoint" error appears to primarily implicate innodb_log_file_size, which is currently set to the default value (5MB):
I'll also double check wait_timeout and interactive_timeout...
Aha! The MySQL error log actually appears to have something useful: