Page MenuHomePhabricator

OpsRelease
ActivePublic

Members (1)

Watchers

  • This project does not have any watchers.
  • View All

Recent Activity

Apr 20 2022

epriestley closed T10847: 30GB Phacility instance caused a series of cascading failures which left web services unreachable as Resolved.

There's nothing particularly useful or actionable here now, so closing it out. (I believe this was the most severe incident Phacility ever experienced while actively maintained.)

Apr 20 2022, 10:43 PM · Ops, Phacility
epriestley closed T12610: Audit behavior of LB healthchecks against *.phacility.com and secure.phabricator.com as Wontfix.

This hasn't caused any more problems in like 4 years, so I guess it's kind of whatever.

Apr 20 2022, 10:30 PM · Ops, Phacility

Dec 19 2021

epriestley closed T12847: A Pathway Towards Private Clusters as Wontfix.

After T13630:

Dec 19 2021, 8:39 PM · Plans, Ops, Phacility

Feb 26 2021

epriestley placed T12611: Write Phabricator HTTP and SSH logs in the production cluster up for grabs.

Enable the SSH and HTTP application logs on the web, repo and admin tiers.

Feb 26 2021, 10:57 PM · Phacility, Ops
epriestley added a revision to T12611: Write Phabricator HTTP and SSH logs in the production cluster: Restricted Differential Revision.
Feb 26 2021, 10:49 PM · Phacility, Ops
epriestley added a revision to T12611: Write Phabricator HTTP and SSH logs in the production cluster: Restricted Differential Revision.
Feb 26 2021, 10:48 PM · Phacility, Ops

May 26 2020

epriestley closed T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction as Resolved.

Continued in T13542.

May 26 2020, 8:13 PM · Plans, Ops, Infrastructure, Phacility
epriestley closed T12801: Simplify Almanac services in the Phacility production cluster as Resolved.

The major offender here (services per instance) was fixed by updating caching, and I destroyed all the old services. This is perhaps spiritually continued in T13542.

May 26 2020, 8:11 PM · Almanac, Ops, Phacility
epriestley closed T12999: Replace cluster magnetic volumes with SSD volumes as Resolved.

Continued in T13542.

May 26 2020, 8:10 PM · Phacility, Ops
epriestley closed T12856: Evaluate various "infrastructure-as-code" products as Resolved.

Continued in T13542. I wrote a Terraform/CloudFormation-style service in PHP over the last couple of days.

May 26 2020, 8:07 PM · Ops, Phacility
epriestley closed T12816: Setup NAT for the primary Phacility cluster as Resolved.

Continued in T13542.

May 26 2020, 8:03 PM · Ops, Phacility

Feb 3 2020

epriestley closed T13483: (2020 Week 5) Restart AWS host db001 (i-3a9b99f2) as Resolved.

Both of these hosts restarted cleanly.

Feb 3 2020, 2:19 PM · Phacility, Ops
epriestley added a comment to T13483: (2020 Week 5) Restart AWS host db001 (i-3a9b99f2).

AWS is also rebooting web007.

Feb 3 2020, 1:22 PM · Phacility, Ops

Jan 30 2020

epriestley triaged T13483: (2020 Week 5) Restart AWS host db001 (i-3a9b99f2) as Normal priority.
Jan 30 2020, 6:36 PM · Phacility, Ops

Jan 15 2020

epriestley closed T13477: (2020 Week 3) Restart AWS host db025 (i-05bc80634586ef7a0) as Resolved.

This went through cleanly.

Jan 15 2020, 4:19 PM · Ops, Phacility
epriestley triaged T13477: (2020 Week 3) Restart AWS host db025 (i-05bc80634586ef7a0) as Low priority.
Jan 15 2020, 2:36 AM · Ops, Phacility

Nov 26 2019

epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

there is no way to bin/host query against the set of instances using a particular repository shard service

Nov 26 2019, 12:14 AM · Phacility, Ops
epriestley added a revision to T13466: AWS instance termination may fail/hang indefinitely: Restricted Differential Revision.
Nov 26 2019, 12:09 AM · Phacility, Ops

Nov 25 2019

epriestley added a revision to T13466: AWS instance termination may fail/hang indefinitely: Restricted Differential Revision.
Nov 25 2019, 11:57 PM · Phacility, Ops
epriestley added a revision to T13466: AWS instance termination may fail/hang indefinitely: Restricted Differential Revision.
Nov 25 2019, 10:46 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

Update Almanac definitions for all instances not on the paired db023 shard.

Nov 25 2019, 5:01 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

PHI1566 is resolved narrowly. These cleanup steps still need to happen.

Nov 25 2019, 4:42 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

(Updating addresses with bin/host query leaves the service address cache dirty (the "mutable structure cache" via PhabricatorRepository->getAlmanacServiceRefs()) so it should be followed with bin/cache purge --caches general.)

Nov 25 2019, 4:29 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

I'll flesh this out more later, but the move away from db123 = repo123 shard pairing, plus bin/host query using mysql makes it difficult to directly query instances using a particular repository service.

Nov 25 2019, 4:15 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

Minor issue that should be looked at during service sync arising from improved validation elsewhere:

Nov 25 2019, 4:06 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

I'm deploying the new host now. We just crossed a release so I'm going to manually restore it to 72f82abe07 once it comes up (see also T13359). Then, I'll resynchronize instance services for active instances.

Nov 25 2019, 3:57 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

Instance termination completed after about 20 minutes and all the volumes detached. Since the original instance can be recycled, I'm going to reattach and restart it, and throw away the replacement host.

Nov 25 2019, 3:51 PM · Phacility, Ops
epriestley added a project to T13466: AWS instance termination may fail/hang indefinitely: Phacility.
Nov 25 2019, 3:48 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

Normal volume detachment is just spinning, which isn't exactly surprising. I'm going to give it a few minutes and then force detachment.

Nov 25 2019, 3:47 PM · Phacility, Ops
epriestley added a comment to T13466: AWS instance termination may fail/hang indefinitely.

To deal with this narrowly, I'm going to:

Nov 25 2019, 3:44 PM · Phacility, Ops
epriestley triaged T13466: AWS instance termination may fail/hang indefinitely as Low priority.
Nov 25 2019, 3:42 PM · Phacility, Ops

Aug 11 2019

epriestley added a revision to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting: Restricted Differential Revision.
Aug 11 2019, 4:19 PM · Ops, Restricted Project, Phacility
epriestley added a revision to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting: Restricted Differential Revision.
Aug 11 2019, 4:10 PM · Ops, Restricted Project, Phacility

Aug 1 2019

epriestley triaged T13359: Phacility deploy workflow should not conflate versions-for-deployment with "latest stable release" as Low priority.
Aug 1 2019, 5:48 PM · Phacility, Ops

Jul 30 2019

epriestley added a revision to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting: Restricted Differential Revision.
Jul 30 2019, 6:20 PM · Ops, Restricted Project, Phacility
epriestley added a revision to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting: Restricted Differential Revision.
Jul 30 2019, 6:18 PM · Ops, Restricted Project, Phacility

Jul 24 2019

epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

Adjusting log_warnings = 2 in production (to get connection aborts into the error log) is also possibly desirable, although the background level of connection abortions (general network flakiness, server restarts during deploy, wait_timeout on very long-running demons?) may be high enough that this is more noise than signal.

Jul 24 2019, 2:03 PM · Ops, Restricted Project, Phacility

Jul 23 2019

epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

We could also consider these things:

Jul 23 2019, 1:10 PM · Ops, Restricted Project, Phacility
epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

So actual actionable stuff here is:

Jul 23 2019, 1:04 PM · Ops, Restricted Project, Phacility
epriestley renamed T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting from During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size" (?) to During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.
Jul 23 2019, 12:59 PM · Ops, Restricted Project, Phacility

Jul 22 2019

epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

Bumping max_allowed_packet to 1G in the server config resolved things. The export process then spent a long time doing a bin/files migration (which could use a progress bar, maybe) and is now doing a dump (which could too, although I'm less sure of how we'd build one).

Jul 22 2019, 11:52 PM · Ops, Restricted Project, Phacility
epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.
190722 18:55:55 [Warning] Aborted connection 6 to db: '<instance>_differential' user: 'root' host: 'localhost' (Got a packet bigger than 'max_allowed_packet' bytes)
Jul 22 2019, 6:59 PM · Ops, Restricted Project, Phacility
epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

I adjusted innodb_log_file_size to 1GB and attempted the import again, but ran into the same issue.

Jul 22 2019, 6:16 PM · Ops, Restricted Project, Phacility
epriestley renamed T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting from During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size" to During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size" (?).
Jul 22 2019, 6:13 PM · Ops, Restricted Project, Phacility
epriestley added a revision to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting: D20677: Provide import/restore guidance for "max_allowed_packet" and "innodb_log_file_size".
Jul 22 2019, 5:05 PM · Ops, Restricted Project, Phacility
epriestley added a revision to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting: Restricted Differential Revision.
Jul 22 2019, 4:38 PM · Ops, Restricted Project, Phacility
epriestley renamed T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting from During MySQL import, server may "go away" on large dumps? to During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size".
Jul 22 2019, 4:27 PM · Ops, Restricted Project, Phacility
epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

The "age of the last checkpoint" error appears to primarily implicate innodb_log_file_size, which is currently set to the default value (5MB):

Jul 22 2019, 4:26 PM · Ops, Restricted Project, Phacility
epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

I'll also double check wait_timeout and interactive_timeout...

Jul 22 2019, 4:01 PM · Ops, Restricted Project, Phacility
epriestley added a comment to T13347: During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.

Aha! The MySQL error log actually appears to have something useful:

Jul 22 2019, 3:55 PM · Ops, Restricted Project, Phacility