There's nothing particularly useful or actionable here now, so closing it out. (I believe this was the most severe incident Phacility ever experienced while actively maintained.)
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 20 2022
This hasn't caused any more problems in like 4 years, so I guess it's kind of whatever.
Dec 19 2021
After T13630:
Feb 26 2021
Enable the SSH and HTTP application logs on the web, repo and admin tiers.
May 26 2020
Continued in T13542.
The major offender here (services per instance) was fixed by updating caching, and I destroyed all the old services. This is perhaps spiritually continued in T13542.
Continued in T13542.
Continued in T13542. I wrote a Terraform/CloudFormation-style service in PHP over the last couple of days.
Continued in T13542.
Feb 3 2020
Both of these hosts restarted cleanly.
AWS is also rebooting web007.
Jan 30 2020
Jan 15 2020
This went through cleanly.
Nov 26 2019
there is no way to bin/host query against the set of instances using a particular repository shard service
Nov 25 2019
Update Almanac definitions for all instances not on the paired db023 shard.
PHI1566 is resolved narrowly. These cleanup steps still need to happen.
(Updating addresses with bin/host query leaves the service address cache dirty (the "mutable structure cache" via PhabricatorRepository->getAlmanacServiceRefs()) so it should be followed with bin/cache purge --caches general.)
I'll flesh this out more later, but the move away from db123 = repo123 shard pairing, plus bin/host query using mysql makes it difficult to directly query instances using a particular repository service.
Minor issue that should be looked at during service sync arising from improved validation elsewhere:
I'm deploying the new host now. We just crossed a release so I'm going to manually restore it to 72f82abe07 once it comes up (see also T13359). Then, I'll resynchronize instance services for active instances.
Instance termination completed after about 20 minutes and all the volumes detached. Since the original instance can be recycled, I'm going to reattach and restart it, and throw away the replacement host.
Normal volume detachment is just spinning, which isn't exactly surprising. I'm going to give it a few minutes and then force detachment.
To deal with this narrowly, I'm going to:
Aug 11 2019
Aug 1 2019
Jul 30 2019
Jul 24 2019
Adjusting log_warnings = 2 in production (to get connection aborts into the error log) is also possibly desirable, although the background level of connection abortions (general network flakiness, server restarts during deploy, wait_timeout on very long-running demons?) may be high enough that this is more noise than signal.
Jul 23 2019
We could also consider these things:
So actual actionable stuff here is:
Jul 22 2019
Bumping max_allowed_packet to 1G in the server config resolved things. The export process then spent a long time doing a bin/files migration (which could use a progress bar, maybe) and is now doing a dump (which could too, although I'm less sure of how we'd build one).
190722 18:55:55 [Warning] Aborted connection 6 to db: '<instance>_differential' user: 'root' host: 'localhost' (Got a packet bigger than 'max_allowed_packet' bytes)
I adjusted innodb_log_file_size to 1GB and attempted the import again, but ran into the same issue.
The "age of the last checkpoint" error appears to primarily implicate innodb_log_file_size, which is currently set to the default value (5MB):
I'll also double check wait_timeout and interactive_timeout...
Aha! The MySQL error log actually appears to have something useful:
Run it with source ...;
Unzip the dump before running it.
Look at the unzipped dump and see if line 13935 is bad in some obvious way.
(Whatever the resolution is here might also motivate tailoring our restore/import instructions, since this error is pretty opaque and the next steps aren't obvious.)
(Internally, see also PHI1329.)
Jul 17 2019
Mar 1 2019
Another possible approach is to use -o LogLevel=ERROR. This gets us into trouble if there are useful INFO messages other than "permanently added X to list of known hosts", but presumably all the important stuff is rasied at ERROR or better.
For Drydock/Harbormaster, the only real way forward I see here is:
Jan 30 2019
Success! D20046 worked to fix the "profiler not sticking across form posts" issue on secure. 🐈
Jan 28 2019
Yeah, I think the issue is:
The "keep the profiler on across form submissions" code isn't working on secure.phabricator.com, even though it works locally and __profile__=page appears on the "Request" tab.
Nov 3 2018
I think everything here is now fully cycled, synchronized, and cleaned up.
Taking care of these now. I expect everything to be pretty routine.
Oct 22 2018
Plus: db018.phacility.net, repo001.phacility.net, db024.phacility.net.
Oct 19 2018
One more of these just came in for repo003.
Oct 8 2018
I think this is all done but want to let things run against bastion007 for a bit before I tear down bastion005.
I also needed to copy the old master.key from bastion005 to bastion007 in /core/lib/keystore/.
I turned bastion.phacility.net and bastion-external.phacillity.net into CNAME records and pointed them at the new bastions.
There's a minor deadlock on bastion deployment with the current scripts: during deploy, we run deploy-key to copy the deploy key from the bastion to the target host during deployment, so that we don't need to put the entire keystore on normal cluster nodes, and so that we don't need to have the keystore on the control host (staff laptop) outside the cluster.
Oct 6 2018
I cycled all the hosts except bastion. saux001 needs to be vetted a bit (it handles "Land Revision" from the web UI) but it isn't critical if it needs a bit more work.
I 'm going to get these underway once the deploy finishes.
Oct 1 2018
"Use the API" seemed to work OK. Of those instances, only bastion005 is at all unusual.
Sep 11 2018
This is now live.
Deploying the changes to web now.
Sep 10 2018
I've issued all instances a 24-hour service credit for the disruption. This should be reflected on your next invoice.
Here's the request rate leading up to the rate limiting:
D19653 (above) changes the per-"Host" rate limit to require "X-Forwarded-For" be present in the request. This should exempt ELB requests from these limits.
... [Mon Sep 10 20:48:43.928021 2018] [:error] [pid 21570] [client 172.30.0.171:16516] Array\n(\n [f] => \n [h] => 172.30.0.60\n)\n ...
... in production today as a next step.
This should have the pleasant side effect of letting us drop the goofy hard-coded internal rate limiting IP list.
There are four rate limits, and I don't currently have enough information to figure out which one triggered. The rate limits are:
Sep 5 2018
Aug 25 2018
That one seemed straightforward.
Doing admin001 now.
Kicking secure001 now.
(It not being covered is covered by T12879.)
I think the only thing on secure or admin which isn't properly covered by deploy automation is the crontab on secure001:
I'm going to do admin001 and secure001 today.
Aug 18 2018
Think I got through the easy ones without any issues. I suspect admin and secure may be a little more involved so I'm going to leave the cat in the bag for the moment.