Page MenuHomePhabricator
Feed Advanced Search

Apr 9 2018

amckinley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

In my mind, the only real problem here is...

Apr 9 2018, 6:54 PM · Plans, Ops, Infrastructure, Phacility
epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

Possibly something dumb like "mount an EFS volume on /mnt/logs/ on every host" is another AWS-only approach. That feels really dumb but maybe it's only somewhat dumb?

Apr 9 2018, 6:35 PM · Plans, Ops, Infrastructure, Phacility
epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

I'm fine with CloudWatch logs, I didn't realize it had a logging thing. I'm currently not terribly warm on investing a ton in CloudWatch alarms or log analysis, but if we can get logs reliably streaming into some storage service so they can't fill up the disk I think that's quite valuable on its own. We can always replace that with something we control directly in the future if we want to invest more here. I can't imagine ever really caring about old logs no matter what happens in the future.

Apr 9 2018, 6:30 PM · Plans, Ops, Infrastructure, Phacility
amckinley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

I definitely don't want to send logs to a third-party service.

Apr 9 2018, 6:26 PM · Plans, Ops, Infrastructure, Phacility
amckinley claimed T12857: Temporary directory fullness can cause daemon issues?.
Apr 9 2018, 6:08 PM · Diffusion, Ops, Daemons, Phacility
amckinley claimed T12611: Write Phabricator HTTP and SSH logs in the production cluster.
Apr 9 2018, 6:08 PM · Phacility, Ops
amckinley claimed T12999: Replace cluster magnetic volumes with SSD volumes.
Apr 9 2018, 6:06 PM · Phacility, Ops
amckinley claimed T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.
Apr 9 2018, 6:05 PM · Plans, Ops, Infrastructure, Phacility
epriestley added a comment to T12847: A Pathway Towards Private Clusters.

In the modern era: I think we generally understand what private clusters will look like now, but I'd like to take a much more iterative approach to getting there than we have in the past. I had this concern above (circa June 2017):

Apr 9 2018, 6:03 PM · Plans, Ops, Phacility
amckinley moved T12847: A Pathway Towards Private Clusters from Backlog to Soon on the Plans board.
Apr 9 2018, 5:31 PM · Plans, Ops, Phacility
amckinley added a project to T12847: A Pathway Towards Private Clusters: Plans.
Apr 9 2018, 5:31 PM · Plans, Ops, Phacility
epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19316: Remove dead "Service Lock" code from Almanac.
Apr 9 2018, 5:11 PM · Conduit, Almanac, Ops, Phacility

Mar 30 2018

epriestley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Restricted Differential Revision.
Mar 30 2018, 10:57 PM · Plans, Ops, Infrastructure, Phacility
epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

I'd like to analyze data on the $5 tier with an eye to removing it. Anecdotally, this doesn't seem to be doing anything good for us ... It also feels like we have a lot of $5 test instances.

Mar 30 2018, 10:50 PM · Plans, Ops, Infrastructure, Phacility
epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

The UI and allocation logic for shards need tweaking.

Mar 30 2018, 10:26 PM · Plans, Ops, Infrastructure, Phacility
epriestley closed T12830: Disentangle "repoXYZ = dbXYZ" in the Phacility cluster as Resolved by committing Restricted Diffusion Commit.
Mar 30 2018, 10:25 PM · Ops, Phacility
epriestley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Restricted Differential Revision.
Mar 30 2018, 10:22 PM · Plans, Ops, Infrastructure, Phacility
epriestley added a revision to T12830: Disentangle "repoXYZ = dbXYZ" in the Phacility cluster: Restricted Differential Revision.
Mar 30 2018, 10:22 PM · Ops, Phacility

Mar 28 2018

epriestley closed T12917: Move domain name registration and SSL to AWS as Resolved.

After clicking 17,000 emails I successfully transferred everything to AWS, with some minor caveats:

Mar 28 2018, 5:04 PM · Ops, Phacility
epriestley added a comment to T12917: Move domain name registration and SSL to AWS.

I moved SSL and registration for phurl.io to AWS in T13113 so this is just a transfer issue now.

Mar 28 2018, 3:28 PM · Ops, Phacility
epriestley added a comment to T13113: phurl.io SSL certificate has expired.

oh wow it actually worked 😱

Mar 28 2018, 3:27 PM · Phacility, Ops
epriestley added a comment to T13113: phurl.io SSL certificate has expired.

AWS also supports DNS-based authorization now, which reduces the need for all the MX juggling.

Mar 28 2018, 3:22 PM · Phacility, Ops
epriestley closed T13113: phurl.io SSL certificate has expired as Resolved.

I was able to MX phurl.io and get an SSL authorization link working. I moved phurl.io SSL to AWS ACM so this (certificate expiration) shouldn't happen again.

Mar 28 2018, 3:19 PM · Phacility, Ops
epriestley added projects to T13113: phurl.io SSL certificate has expired: Ops, Phacility.
Mar 28 2018, 3:01 PM · Phacility, Ops

Mar 19 2018

epriestley closed T12988: Remove flag "--master" from bin/remote as Resolved.

I removed this in rCORE36d2ef5dffe441ba1175e362bb73f0e43a9f70a2.

Mar 19 2018, 2:55 PM · Ops, Phacility

Mar 9 2018

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

When phage is executed against a large number of hosts, a bunch of the processes fail immediately. This is likely easy to fix, either some sort of macOS ulimit or one of the sshd knobs. This isn't the limiting factor in any operational activity today but will be some day.

Mar 9 2018, 8:30 PM · Plans, Ops, Infrastructure, Phacility

Mar 5 2018

epriestley moved T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction from Backlog to Tentative on the Plans board.
Mar 5 2018, 3:03 PM · Plans, Ops, Infrastructure, Phacility

Feb 25 2018

epriestley added a project to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Plans.
Feb 25 2018, 3:28 PM · Plans, Ops, Infrastructure, Phacility

Feb 14 2018

epriestley renamed T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction from Plans: Phacility cluster infrastructure improvements to Plans: Phacility cluster caching, renaming, and rebalance/compaction.
Feb 14 2018, 2:20 PM · Plans, Ops, Infrastructure, Phacility
epriestley closed T12217: Reduce the hardware cost of Phacility free instances as Invalid.

We no longer offer free instances and I don't currently plan to offer them again, so this is moot.

Feb 14 2018, 2:19 PM · Ops, Phacility
epriestley closed T12217: Reduce the hardware cost of Phacility free instances, a subtask of T12218: Reduce the operational cost of a larger Phacility cluster, as Invalid.
Feb 14 2018, 2:19 PM · Ops, Phacility
epriestley closed T12218: Reduce the operational cost of a larger Phacility cluster as Resolved.

We no longer offer free instances so tier growth is slower, and I plan to compact the tier in the nearish term.

Feb 14 2018, 2:18 PM · Ops, Phacility
epriestley closed T12298: Allow daemon pools to autoscale down to 0 processes, a subtask of T12217: Reduce the hardware cost of Phacility free instances, as Wontfix.
Feb 14 2018, 2:12 PM · Ops, Phacility
epriestley closed T12298: Allow daemon pools to autoscale down to 0 processes as Wontfix.

We no longer offer free instances so I don't currently plan to pursue this.

Feb 14 2018, 2:12 PM · Daemons, Ops, Phacility
epriestley triaged T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction as Normal priority.
Feb 14 2018, 2:10 PM · Plans, Ops, Infrastructure, Phacility

Feb 6 2018

epriestley added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

I'd either want to pay some service to deal with this or run an open source server that I was confident it was possible to fix, but one option is definitely to run Postfix/Dovecot/Exim/whatever the kids use these days.

Feb 6 2018, 11:58 AM · Phacility, Ops
avivey added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

If all you're going for are emails, why not spin up your own Exchange server (or whatever the kids use these days)? With maybe forwarding to a epriestley-phacility@gmail.com so you remain compliant?

Feb 6 2018, 2:01 AM · Phacility, Ops

Feb 5 2018

epriestley added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

🎉🎉🎉 I RECEIVED AN EMAIL AND CLICKED A LINK CONTAINED INSIDE IT 🎉🎉🎉

Feb 5 2018, 4:21 PM · Phacility, Ops
epriestley added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

Ah, it looks like Google Groups spam filtered some of the Twitter verification mail!

Feb 5 2018, 4:18 PM · Phacility, Ops
epriestley added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

Some of this seems to be that messages sent from my @phacility.com address are sometimes eaten entirely (?) or just not delivered to me (?), presumably because I'm a recipient. I still haven't been able to get the confirmation link from Twitter, but it's possible that's on Twitter's end.

Feb 5 2018, 4:15 PM · Phacility, Ops
epriestley added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

I figured some of this might be Safari vs Chrome, even though there's no obvious indication that Safari is having issues (e.g., I didn't catch any JS errors in the console).

Feb 5 2018, 3:59 PM · Phacility, Ops
epriestley added a comment to T13062: Trying to manage anything in Gsuite is kind of not great?.

There's an hourglass icon in the header menu. I don't know what this is for. When I click this, I get a "Loading..." menu which never loads. I reloaded the page, too. There's nothing pertinent in the console error log.

Feb 5 2018, 3:53 PM · Phacility, Ops
epriestley triaged T13062: Trying to manage anything in Gsuite is kind of not great? as Wishlist priority.
Feb 5 2018, 3:51 PM · Phacility, Ops

Jan 31 2018

epriestley closed T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs as Resolved.

See T13056 for followup.

Jan 31 2018, 2:40 PM · Transactions, Ops, Phacility

Jan 30 2018

epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

I just let that run for a while but it finished at some point:

Jan 30 2018, 8:43 PM · Transactions, Ops, Phacility
epriestley closed T10655: Phacility cluster mail deliverability issue as Resolved.
Jan 30 2018, 7:14 PM · Ops, Phacility
epriestley added a comment to T10655: Phacility cluster mail deliverability issue .

I'm going to roll this forward into T12677 since SES managed to one-up this by a healthy margin in T12237.

Jan 30 2018, 7:14 PM · Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.
...
 OPTIMIZE  Optimizing table "<instance>_audit"."audit_transaction"...
  DONE  Compacted table by 139 GB in 910,219ms.
...
Jan 30 2018, 12:53 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

The compaction completed overnight. I'm optimizing the tables now.

Jan 30 2018, 12:32 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

Pool is full again, repo is upgrading, edges are compacting on the instance shard.

Jan 30 2018, 3:23 AM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

web004 is deploying now.

Jan 30 2018, 3:16 AM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

web004 died abruptly so I'm going to fix that and deploy these changes at the same time.

Jan 30 2018, 3:12 AM · Transactions, Ops, Phacility

Jan 29 2018

epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

Even on our fairly normal data, the effect was a little bit more dramatic than I'd expected:

Jan 29 2018, 11:04 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

I'm going to optimize + probe secure001 now and see if any of the tables above shrunk. I'm expecting a very modest effect combined with zero user visible changes in the UI despite throwing away a bunch of data.

Jan 29 2018, 8:42 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

Took like 3-ish minutes and did this:

Jan 29 2018, 8:40 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

I double checked that our backups are working.

Jan 29 2018, 8:35 PM · Transactions, Ops, Phacility
epriestley removed a project from T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs: 🐳.
Jan 29 2018, 7:41 PM · Transactions, Ops, Phacility
epriestley added a project to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs: 🐳.

Editing some edges on the new code as a sanity check before I compact things.

Jan 29 2018, 7:41 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

(Pushing this to secure, stuff might be funky for a minute while I gently massage the database.)

Jan 29 2018, 7:36 PM · Transactions, Ops, Phacility

Jan 27 2018

epriestley added a revision to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs: D18949: Add a bit of test coverage for bulky vs compact edge data representations.
Jan 27 2018, 3:33 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

My plan is to pick those to stable, then compact-edges here on secure, then compact-edges on the affected 130GB instance. There's some value in doing this sooner rather than later because the backups for 130GB of edge data are having some issues. The instance is a free test instance so this isn't a huge concern, but I'd sleep better if it was running smoothly. If you don't run compact-edges I think the worst those changes could really do is cause some kind of temporary display bug with new transactions, so the risk should be pretty small.

Jan 27 2018, 2:52 PM · Transactions, Ops, Phacility
epriestley added a revision to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs: D18948: Add `bin/garbage compact-edges` to compact edges into the new format.
Jan 27 2018, 2:19 PM · Transactions, Ops, Phacility
epriestley added a revision to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs: D18947: Write edge transactions in a more compact way.
Jan 27 2018, 1:56 PM · Transactions, Ops, Phacility
epriestley added a revision to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs: D18946: Wrap edge transaction readers in a translation layer.
Jan 27 2018, 1:30 PM · Transactions, Ops, Phacility
epriestley added a comment to T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.

Bad news: data still has one reader/writer in the Asana-to-Revision linking implementation. So we can't completely get rid of that yet.

Jan 27 2018, 1:25 PM · Transactions, Ops, Phacility
epriestley updated the task description for T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.
Jan 27 2018, 1:21 PM · Transactions, Ops, Phacility
epriestley updated the task description for T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs.
Jan 27 2018, 1:03 PM · Transactions, Ops, Phacility
epriestley triaged T13051: Transaction edge storage is inefficient / an instance has 130GB of commit cross-refs as Normal priority.
Jan 27 2018, 12:59 PM · Transactions, Ops, Phacility

Jan 26 2018

epriestley closed T13050: 2018 Week 4 pre-release deploy issues on secure.phabricator.com as Resolved.

this can probably be figured out by examining the 2.16.0 release notes

Jan 26 2018, 9:43 PM · Phacility, Ops
epriestley added a revision to T13050: 2018 Week 4 pre-release deploy issues on secure.phabricator.com: D18945: Move the fix for Git 2.16.0 from the "Mercurial" part of the code to the "Git" part of the code.
Jan 26 2018, 9:43 PM · Phacility, Ops
epriestley added a revision to T13050: 2018 Week 4 pre-release deploy issues on secure.phabricator.com: D18944: Pass "." to `git grep` to satisfy "all paths" for Git 2.16.0.
Jan 26 2018, 9:37 PM · Phacility, Ops
epriestley triaged T13050: 2018 Week 4 pre-release deploy issues on secure.phabricator.com as Normal priority.
Jan 26 2018, 9:33 PM · Phacility, Ops

Nov 28 2017

epriestley added a revision to T12646: Reduce the impact of "admin" backups on instances: Restricted Differential Revision.
Nov 28 2017, 3:51 PM · Phacility, Ops
epriestley added a revision to T12297: Make Conduit API calls on `admin.phacility.com` reasonably easy to profile: Restricted Differential Revision.
Nov 28 2017, 3:51 PM · XHProf, Ops, Phacility
epriestley added a revision to T12801: Simplify Almanac services in the Phacility production cluster: Restricted Differential Revision.
Nov 28 2017, 3:51 PM · Almanac, Ops, Phacility

Nov 17 2017

amckinley added a comment to T13017: *.phacility.com TLS certificate expiration.

Fixed! Thanks again for the report.

Nov 17 2017, 5:18 PM · Phacility, Ops
amckinley added a comment to T13017: *.phacility.com TLS certificate expiration.

Thanks; good catch. Fixing now.

Nov 17 2017, 4:04 PM · Phacility, Ops
avivey added a comment to T13017: *.phacility.com TLS certificate expiration.

https://phacility.com/ is still giving ssl error - it's net::ERR_CERT_COMMON_NAME_INVALID, I think (Maybe *.x doesn't cover x?)

Nov 17 2017, 1:58 PM · Phacility, Ops

Nov 16 2017

amckinley shifted T13017: *.phacility.com TLS certificate expiration from the Restricted Space space to the S1 Core space.
Nov 16 2017, 8:50 PM · Phacility, Ops
amckinley created T13017: *.phacility.com TLS certificate expiration.
Nov 16 2017, 8:42 PM · Phacility, Ops
mgood added a comment to T11815: Rotate *.phacility.com SSL certificate.

Ok, I received a follow-up from support and it's working again.

Nov 16 2017, 8:17 PM · Phacility, Ops
mgood added a comment to T11815: Rotate *.phacility.com SSL certificate.

This doesn't appear to be working. My team is reporting lots of problems with certificate errors with our hosted instance. It was working for me in Chrome as long as I was signed in, but opening a private tab I was redirected to admin.phacility.com and received an error there that the cert had expired.

Nov 16 2017, 8:06 PM · Phacility, Ops

Nov 10 2017

epriestley renamed T13012: Mercurial "--config" and "--debugger" command injection vulnerability from Mercurial "--config" command injection vulnerability to Mercurial "--config" and "--debugger" command injection vulnerability.
Nov 10 2017, 3:42 PM · Mercurial, Security

Nov 9 2017

epriestley added a comment to T13012: Mercurial "--config" and "--debugger" command injection vulnerability.

The other major thing I tried was throwing exceptions when the values for %s, etc., contained --debugger or --config while constructing Mercurial commands.

Nov 9 2017, 4:08 PM · Mercurial, Security
epriestley added a comment to T13012: Mercurial "--config" and "--debugger" command injection vulnerability.

I have a not-so-great patch for this ready when Mercurial makes a decision. This isn't great, but it's the best I could come up with after trying a few things.

Nov 9 2017, 3:43 PM · Mercurial, Security
epriestley added a comment to T13012: Mercurial "--config" and "--debugger" command injection vulnerability.

In reply:

Nov 9 2017, 3:36 PM · Mercurial, Security
epriestley added a comment to T13012: Mercurial "--config" and "--debugger" command injection vulnerability.

From @durin42, via Mercurial Security:

Nov 9 2017, 3:35 PM · Mercurial, Security
epriestley created T13012: Mercurial "--config" and "--debugger" command injection vulnerability.
Nov 9 2017, 2:50 PM · Mercurial, Security

Nov 6 2017

pouyana added a comment to T12856: Evaluate various "infrastructure-as-code" products.

For any body still interested, the project was in PHP, I had to rewrite in go, so it can be used elsewhere.

Nov 6 2017, 1:18 PM · Ops, Phacility

Oct 23 2017

epriestley closed T13009: MySQL may take several seconds after restart to begin listening on domain socket as Resolved.

I pushed secure004 with that patch and it worked fine, although mysql came back up fast enough that we didn't have to wait. Since I don't have a way to actually trigger this condition, I'm going to assume this is resolved until we have evidence otherwise.

Oct 23 2017, 8:11 PM · Ops, Phacility
epriestley closed T13008: Process slot exhaustion in Phacility web tier as Resolved.

This has been stable for about a week now.

Oct 23 2017, 6:24 PM · Ops, Phacility
epriestley added a comment to T13003: `admin.phacility.com` is receiving a huge volume of "leafweb" traffic.

This traffic eventually stopped on October 20th.

Oct 23 2017, 6:24 PM · Ops, Phacility
epriestley closed T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table as Resolved.

The pruned (at 0.15 threshold), optimized ngram index for the original affected instance is only ~13GB, which is entirely manageable with the changes to backups, and we haven't run into other issues.

Oct 23 2017, 6:21 PM · Ops, Phacility
epriestley added a comment to T13009: MySQL may take several seconds after restart to begin listening on domain socket.

We hit one of more of these (db024) last week so D18725 should fix it.

Oct 23 2017, 6:09 PM · Ops, Phacility
epriestley added a revision to T13009: MySQL may take several seconds after restart to begin listening on domain socket: Restricted Differential Revision.
Oct 23 2017, 6:03 PM · Ops, Phacility

Oct 22 2017

holmboe added a comment to T12856: Evaluate various "infrastructure-as-code" products.

I wrote a small dynamic inventory client (based on almanac /passphrase conduit API). It is not open source but I can open source it, if there is a need.

Oct 22 2017, 5:05 PM · Ops, Phacility

Oct 17 2017

epriestley added a revision to T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table: D18710: Parameterize the common ngrams threshold.
Oct 17 2017, 9:00 PM · Ops, Phacility

Oct 16 2017

epriestley lowered the priority of T13008: Process slot exhaustion in Phacility web tier from Unbreak Now! to Normal.

I'm going to call this tentatively done and leave it open for a week or two to see if there's any fallout.

Oct 16 2017, 3:55 PM · Ops, Phacility
epriestley added a comment to T13008: Process slot exhaustion in Phacility web tier.

Something is making a lot of requests

Oct 16 2017, 3:54 PM · Ops, Phacility
epriestley added a comment to T13008: Process slot exhaustion in Phacility web tier.

The behavior of mod_reqtimeout is weird in production. Some of this is the LB, but some of it I don't have a good explanation for.

Oct 16 2017, 3:38 PM · Ops, Phacility