Ops

There's nothing particularly useful or actionable here now, so closing it out. (I believe this was the most severe incident Phacility ever experienced while actively maintained.)

This hasn't caused any more problems in like 4 years, so I guess it's kind of whatever.

Enable the SSH and HTTP application logs on the web, repo and admin tiers.

Continued in T13542.

The major offender here (services per instance) was fixed by updating caching, and I destroyed all the old services. This is perhaps spiritually continued in T13542.

Continued in T13542.

Continued in T13542. I wrote a Terraform/CloudFormation-style service in PHP over the last couple of days.

Continued in T13542.

Both of these hosts restarted cleanly.

AWS is also rebooting web007.

This went through cleanly.

there is no way to bin/host query against the set of instances using a particular repository shard service

Update Almanac definitions for all instances not on the paired db023 shard.

PHI1566 is resolved narrowly. These cleanup steps still need to happen.

(Updating addresses with bin/host query leaves the service address cache dirty (the "mutable structure cache" via PhabricatorRepository->getAlmanacServiceRefs()) so it should be followed with bin/cache purge --caches general.)

I'll flesh this out more later, but the move away from db123 = repo123 shard pairing, plus bin/host query using mysql makes it difficult to directly query instances using a particular repository service.

Minor issue that should be looked at during service sync arising from improved validation elsewhere:

I'm deploying the new host now. We just crossed a release so I'm going to manually restore it to 72f82abe07 once it comes up (see also T13359). Then, I'll resynchronize instance services for active instances.

Instance termination completed after about 20 minutes and all the volumes detached. Since the original instance can be recycled, I'm going to reattach and restart it, and throw away the replacement host.

Normal volume detachment is just spinning, which isn't exactly surprising. I'm going to give it a few minutes and then force detachment.

To deal with this narrowly, I'm going to:

Adjusting log_warnings = 2 in production (to get connection aborts into the error log) is also possibly desirable, although the background level of connection abortions (general network flakiness, server restarts during deploy, wait_timeout on very long-running demons?) may be high enough that this is more noise than signal.

We could also consider these things:

So actual actionable stuff here is:

Bumping max_allowed_packet to 1G in the server config resolved things. The export process then spent a long time doing a bin/files migration (which could use a progress bar, maybe) and is now doing a dump (which could too, although I'm less sure of how we'd build one).

190722 18:55:55 [Warning] Aborted connection 6 to db: '<instance>_differential' user: 'root' host: 'localhost' (Got a packet bigger than 'max_allowed_packet' bytes)

I adjusted innodb_log_file_size to 1GB and attempted the import again, but ran into the same issue.

The "age of the last checkpoint" error appears to primarily implicate innodb_log_file_size, which is currently set to the default value (5MB):

I'll also double check wait_timeout and interactive_timeout...

Aha! The MySQL error log actually appears to have something useful:

OpsRelease
ActivePublic
Watch Project

Members (1)

Watchers

Recent Activity
View All

Apr 20 2022

Dec 19 2021

Feb 26 2021

May 26 2020

Feb 3 2020

Jan 30 2020

Jan 15 2020

Nov 26 2019

Nov 25 2019

Aug 11 2019

Aug 1 2019

Jul 30 2019

Jul 24 2019

Jul 23 2019

Jul 22 2019

OpsReleaseActivePublicWatch Project

Members (1)

Watchers

Recent ActivityView All

Apr 20 2022

Dec 19 2021

Feb 26 2021

May 26 2020

Feb 3 2020

Jan 30 2020

Jan 15 2020

Nov 26 2019

Nov 25 2019

Aug 11 2019

Aug 1 2019

Jul 30 2019

Jul 24 2019

Jul 23 2019

Jul 22 2019

OpsRelease
ActivePublic
Watch Project

Recent Activity
View All