I'm going to stop/start at least some of these now.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 18 2018
Aug 15 2018
Oh, right, "meetings". I've heard of those!
In T13062#235217, @epriestley wrote:
Aug 13 2018
Hahah you beat me to it!
I think T13183 is the exact same hosts. :)
In T13185#240300, @epriestley wrote:Same as T13183?
Same as T13183?
Aug 10 2018
That appears to have fixed it, so this was just T12171-redux.
[secure004] Error: You need to install the Node.js "ws" module for websocket support. See "Notifications User Guide: Setup and Configuration" in the documentation for instructions. SyntaxError: Use of const in strict mode.
Aug 9 2018
One specific thing we hit in the context of T13180: sbuild001 stopped and started with a new public IP, and the Almanac device record needed to be updated with the new address.
This was more fun than expected. Notes for future historians:
Aug 8 2018
Doing this now.
Sure, will do this afternoon.
@amckinley, do you want to take this one whenever you get a chance? This seemed to work the last time:
Aug 1 2018
Jul 21 2018
It seems like that went through cleanly. I just did Stop + Start + bin/remote deploy on the affected hosts. I then launched a test instance with placement allocations on two of the affected services; it came up cleanly.
Beginning the stop/start stuff now.
I'm planning to stop/start these instances during the maintenance window today since getting the rebalance into production in the next five days seems wildly optimistic.
Jul 20 2018
See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.
Jul 19 2018
EC2 volume ddata005.phacility.net filled up, causing problems for instances hosted on db005, leading to PHI771. I'll dig back into the CloudWatch monitoring stuff I setup a few months ago and make the db hosts report storage metrics the same way the repo hosts already do.
Jul 17 2018
Apr 20 2018
As a general update here since a couple installs are keeping an eye on it, I have a big chunk of the "use instance.state for everything" diff written, but some of the caching logic is tricky and I haven't had a solid chunk of uninterrupted time to stare at it in the last couple days so I don't expect to get it out in time for the release this week.
Apr 18 2018
Apr 13 2018
We can also get a "Something something won't allocate a terminal because something something pseudo-tty." warning without -q. I'm 100% confident this is the exact text of the error.
Apr 12 2018
I'm leaning toward:
Apr 11 2018
Everything here should pretty much work except:
Apr 10 2018
I think all the T13076 stuff is blocked on me for now unless you're feeling especially ambitious about trying to get bin/host download working on 2GB+ files (T12907 / D19011). That might be a bit of a mess of a task though since I think I have a lot of secret cURL knowledge from over the years that is only somewhat-documented in HTTPSFuture.
Backups should run continuously (starting 12 hours after the instance launches, then every 24 hours after that):
I'm going to go through the volumes type-by-type instead of host-by-host, starting with the backup volumes (because those should be fine to detach as long as the backup isn't running). It looks like backups run daily at 2300, so that should give me plenty of time.
Hmmm, then maybe the narrowest way to solve this is to just setup disk space alarms. That's a thing we'd want independent of how we eventually build the One True Logging System. The CloudWatch agent collects disk stats out of the box, and takes a list of mountpoints to track. Details here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html
Before I can get rid of AlmanacDeviceTransaction::TYPE_INTERFACE, we have two meaningful callsites in rSERVICES and one unit test in rSAAS to clean up.
There's a bit of a mess with AlmanacInterface and AlmanacDevice. Currently, AlmanacInterface does not use transactions, and is edited purely as a side effect of INTERFACE transactions applying to AlmanacDevice. I'm going to change how this works so that AlmanacInterface is a normal transactional object and can use the same rules and infrastructure as everything else.
I think "run weekly with the deploy" already works pretty well for normal log rotation with logs on local disk, and in a perfect world logs would be buffering on disk only briefly before being sent over the network to dedicated log service so we wouldn't need cron there either.
Apr 9 2018
How often does the size directive in a logrotate script get evaluated?
Yeah, I think "solve it forever" is probably "first-party Sentry-like application", this just seems like a problem that should have a 15-minute "solve it for now" solution in the form of something that gives us "a thing that looks like a disk but never fills up" that will hold for a year or two until we build "Phentry".
That makes sense. I was coming at it more from the direction of “if we’re going to touch logs, lets solve it forever”. How about we make everything log locally through syslog and use logrotate to avoid filling disks? Then we can come back and point syslog at something more elaborate later without having to touch all the things that currently generate logs.
I believe there are zero or nearly-zero actionable errors in the log today and that if we had a perfect Sentry-like system it wouldn't actually help us identify or fix any errors -- we'd just ignore it after a week like we currently ignore all the other production error logs.
There's some flavor of bin/remote ssh <some-host> -- tail -f <some-log_location>.
The "logs fill up disks" problem is more "sudden bursts of unusual error/activity that cause log rates to increase 1000x+ fill up disks", not "logs don't rotate / normal log volumes fill up disks". Most logs do already rotate today, I think with the exception of the aphlict log being a bit of a weird case, maybe.
phage tail <some-instance-name> <some-log-type>
But I don't think this is too valuable if it doesn't solve the "logs fill up disks" problem -- does the agent rotate/reap the logs?