In my mind, the only real problem here is...
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 9 2018
Possibly something dumb like "mount an EFS volume on /mnt/logs/ on every host" is another AWS-only approach. That feels really dumb but maybe it's only somewhat dumb?
I'm fine with CloudWatch logs, I didn't realize it had a logging thing. I'm currently not terribly warm on investing a ton in CloudWatch alarms or log analysis, but if we can get logs reliably streaming into some storage service so they can't fill up the disk I think that's quite valuable on its own. We can always replace that with something we control directly in the future if we want to invest more here. I can't imagine ever really caring about old logs no matter what happens in the future.
I definitely don't want to send logs to a third-party service.
In the modern era: I think we generally understand what private clusters will look like now, but I'd like to take a much more iterative approach to getting there than we have in the past. I had this concern above (circa June 2017):
Mar 30 2018
I'd like to analyze data on the $5 tier with an eye to removing it. Anecdotally, this doesn't seem to be doing anything good for us ... It also feels like we have a lot of $5 test instances.
The UI and allocation logic for shards need tweaking.
Mar 28 2018
After clicking 17,000 emails I successfully transferred everything to AWS, with some minor caveats:
I moved SSL and registration for phurl.io to AWS in T13113 so this is just a transfer issue now.
oh wow it actually worked 😱
AWS also supports DNS-based authorization now, which reduces the need for all the MX juggling.
I was able to MX phurl.io and get an SSL authorization link working. I moved phurl.io SSL to AWS ACM so this (certificate expiration) shouldn't happen again.
Mar 19 2018
I removed this in rCORE36d2ef5dffe441ba1175e362bb73f0e43a9f70a2.
Mar 9 2018
When phage is executed against a large number of hosts, a bunch of the processes fail immediately. This is likely easy to fix, either some sort of macOS ulimit or one of the sshd knobs. This isn't the limiting factor in any operational activity today but will be some day.
Mar 5 2018
Feb 25 2018
Feb 14 2018
We no longer offer free instances and I don't currently plan to offer them again, so this is moot.
We no longer offer free instances so tier growth is slower, and I plan to compact the tier in the nearish term.
We no longer offer free instances so I don't currently plan to pursue this.
Feb 6 2018
I'd either want to pay some service to deal with this or run an open source server that I was confident it was possible to fix, but one option is definitely to run Postfix/Dovecot/Exim/whatever the kids use these days.
If all you're going for are emails, why not spin up your own Exchange server (or whatever the kids use these days)? With maybe forwarding to a epriestley-phacility@gmail.com so you remain compliant?
Feb 5 2018
🎉🎉🎉 I RECEIVED AN EMAIL AND CLICKED A LINK CONTAINED INSIDE IT 🎉🎉🎉
Ah, it looks like Google Groups spam filtered some of the Twitter verification mail!
Some of this seems to be that messages sent from my @phacility.com address are sometimes eaten entirely (?) or just not delivered to me (?), presumably because I'm a recipient. I still haven't been able to get the confirmation link from Twitter, but it's possible that's on Twitter's end.
I figured some of this might be Safari vs Chrome, even though there's no obvious indication that Safari is having issues (e.g., I didn't catch any JS errors in the console).
There's an hourglass icon in the header menu. I don't know what this is for. When I click this, I get a "Loading..." menu which never loads. I reloaded the page, too. There's nothing pertinent in the console error log.
Jan 31 2018
See T13056 for followup.
Jan 30 2018
I just let that run for a while but it finished at some point:
... OPTIMIZE Optimizing table "<instance>_audit"."audit_transaction"... DONE Compacted table by 139 GB in 910,219ms. ...
The compaction completed overnight. I'm optimizing the tables now.
Pool is full again, repo is upgrading, edges are compacting on the instance shard.
web004 is deploying now.
web004 died abruptly so I'm going to fix that and deploy these changes at the same time.
Jan 29 2018
Even on our fairly normal data, the effect was a little bit more dramatic than I'd expected:
I'm going to optimize + probe secure001 now and see if any of the tables above shrunk. I'm expecting a very modest effect combined with zero user visible changes in the UI despite throwing away a bunch of data.
Took like 3-ish minutes and did this:
I double checked that our backups are working.
Editing some edges on the new code as a sanity check before I compact things.
(Pushing this to secure, stuff might be funky for a minute while I gently massage the database.)
Jan 27 2018
My plan is to pick those to stable, then compact-edges here on secure, then compact-edges on the affected 130GB instance. There's some value in doing this sooner rather than later because the backups for 130GB of edge data are having some issues. The instance is a free test instance so this isn't a huge concern, but I'd sleep better if it was running smoothly. If you don't run compact-edges I think the worst those changes could really do is cause some kind of temporary display bug with new transactions, so the risk should be pretty small.
Bad news: data still has one reader/writer in the Asana-to-Revision linking implementation. So we can't completely get rid of that yet.
Jan 26 2018
this can probably be figured out by examining the 2.16.0 release notes
Nov 28 2017
Nov 17 2017
Fixed! Thanks again for the report.
Thanks; good catch. Fixing now.
https://phacility.com/ is still giving ssl error - it's net::ERR_CERT_COMMON_NAME_INVALID, I think (Maybe *.x doesn't cover x?)
Nov 16 2017
Ok, I received a follow-up from support and it's working again.
This doesn't appear to be working. My team is reporting lots of problems with certificate errors with our hosted instance. It was working for me in Chrome as long as I was signed in, but opening a private tab I was redirected to admin.phacility.com and received an error there that the cert had expired.
Nov 10 2017
Nov 9 2017
The other major thing I tried was throwing exceptions when the values for %s, etc., contained --debugger or --config while constructing Mercurial commands.
I have a not-so-great patch for this ready when Mercurial makes a decision. This isn't great, but it's the best I could come up with after trying a few things.
In reply:
From @durin42, via Mercurial Security:
Nov 6 2017
For any body still interested, the project was in PHP, I had to rewrite in go, so it can be used elsewhere.
Oct 23 2017
I pushed secure004 with that patch and it worked fine, although mysql came back up fast enough that we didn't have to wait. Since I don't have a way to actually trigger this condition, I'm going to assume this is resolved until we have evidence otherwise.
This has been stable for about a week now.
This traffic eventually stopped on October 20th.
The pruned (at 0.15 threshold), optimized ngram index for the original affected instance is only ~13GB, which is entirely manageable with the changes to backups, and we haven't run into other issues.
We hit one of more of these (db024) last week so D18725 should fix it.
Oct 22 2017
In T12856#231964, @pouyana wrote:I wrote a small dynamic inventory client (based on almanac /passphrase conduit API). It is not open source but I can open source it, if there is a need.
Oct 17 2017
Oct 16 2017
I'm going to call this tentatively done and leave it open for a week or two to see if there's any fallout.
Something is making a lot of requests
The behavior of mod_reqtimeout is weird in production. Some of this is the LB, but some of it I don't have a good explanation for.