T13182 was a resurgence of this exact issue, but I'm currently banking on the fix being a move to Ubuntu 16 with T13076 before this breaks again, given that it was stable for ~18 months.

Aug 10 2018, 5:26 PM · Ops, Phacility, Aphlict

epriestley closed T13182: Something in the Ubuntu environment changed, affecting node / "ws" as Resolved.

That appears to have fixed it, so this was just T12171-redux.

Aug 10 2018, 5:26 PM · Phacility, Ops

epriestley added a comment to T13182: Something in the Ubuntu environment changed, affecting node / "ws".

[secure004] Error: You need to install the Node.js "ws" module for websocket support. See "Notifications User Guide: Setup and Configuration" in the documentation for instructions. SyntaxError: Use of const in strict mode.

Aug 10 2018, 4:54 PM · Phacility, Ops

epriestley triaged T13182: Something in the Ubuntu environment changed, affecting node / "ws" as Low priority.

Aug 10 2018, 4:50 PM · Phacility, Ops

Aug 9 2018

epriestley added a comment to T13121: Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands.

One specific thing we hit in the context of T13180: sbuild001 stopped and started with a new public IP, and the Almanac device record needed to be updated with the new address.

Aug 9 2018, 12:05 AM · Phacility, Ops

amckinley closed T13180: AWS is rebooting sbuild001.phacility.net on August 15, 2018 as Resolved.

This was more fun than expected. Notes for future historians:

Aug 9 2018, 12:04 AM · Phacility, Ops

Aug 8 2018

amckinley added a comment to T13180: AWS is rebooting sbuild001.phacility.net on August 15, 2018.

Doing this now.

Aug 8 2018, 10:58 PM · Phacility, Ops

amckinley added a comment to T13180: AWS is rebooting sbuild001.phacility.net on August 15, 2018.

Sure, will do this afternoon.

Aug 8 2018, 4:40 PM · Phacility, Ops

epriestley assigned T13180: AWS is rebooting sbuild001.phacility.net on August 15, 2018 to amckinley.

@amckinley, do you want to take this one whenever you get a chance? This seemed to work the last time:

Aug 8 2018, 3:41 PM · Phacility, Ops

Aug 1 2018

epriestley triaged T13180: AWS is rebooting sbuild001.phacility.net on August 15, 2018 as Low priority.

Aug 1 2018, 1:09 PM · Phacility, Ops

Jul 21 2018

epriestley closed T13167: AWS is rebooting several production hosts (July 2018) as Resolved.

It seems like that went through cleanly. I just did Stop + Start + bin/remote deploy on the affected hosts. I then launched a test instance with placement allocations on two of the affected services; it came up cleanly.

Jul 21 2018, 10:50 AM · Phacility, Ops

epriestley added a comment to T13167: AWS is rebooting several production hosts (July 2018).

Beginning the stop/start stuff now.

Jul 21 2018, 10:24 AM · Phacility, Ops

epriestley added a comment to T13167: AWS is rebooting several production hosts (July 2018).

I'm planning to stop/start these instances during the maintenance window today since getting the rebalance into production in the next five days seems wildly optimistic.

Jul 21 2018, 9:27 AM · Phacility, Ops

Jul 20 2018

epriestley added a revision to T13167: AWS is rebooting several production hosts (July 2018): Restricted Differential Revision.

Jul 20 2018, 10:00 PM · Phacility, Ops

epriestley added a comment to T13167: AWS is rebooting several production hosts (July 2018).

See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.

Jul 20 2018, 9:52 PM · Phacility, Ops

epriestley added a revision to T13167: AWS is rebooting several production hosts (July 2018): Restricted Differential Revision.

Jul 20 2018, 5:31 PM · Phacility, Ops

epriestley added a revision to T13167: AWS is rebooting several production hosts (July 2018): Restricted Differential Revision.

Jul 20 2018, 5:30 PM · Phacility, Ops

epriestley added a revision to T13167: AWS is rebooting several production hosts (July 2018): D19521: Allow callers to choose which directory a "TempFile" is created in.

Jul 20 2018, 5:28 PM · Phacility, Ops

epriestley added a revision to T13167: AWS is rebooting several production hosts (July 2018): Restricted Differential Revision.

Jul 20 2018, 5:03 PM · Phacility, Ops

epriestley added a revision to T13167: AWS is rebooting several production hosts (July 2018): Restricted Differential Revision.

Jul 20 2018, 4:54 PM · Phacility, Ops

epriestley renamed T13167: AWS is rebooting several production hosts (July 2018) from AWS is rebooting every host (July 2018) to AWS is rebooting several production hosts (July 2018).

Jul 20 2018, 4:52 PM · Phacility, Ops

Jul 19 2018

amckinley added a comment to T12857: Temporary directory fullness can cause daemon issues?.

EC2 volume ddata005.phacility.net filled up, causing problems for instances hosted on db005, leading to PHI771. I'll dig back into the CloudWatch monitoring stuff I setup a few months ago and make the db hosts report storage metrics the same way the repo hosts already do.

Jul 19 2018, 11:42 PM · Diffusion, Ops, Daemons, Phacility

Jul 17 2018

epriestley triaged T13167: AWS is rebooting several production hosts (July 2018) as Normal priority.

Jul 17 2018, 10:20 PM · Phacility, Ops

Apr 20 2018

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

As a general update here since a couple installs are keeping an eye on it, I have a big chunk of the "use instance.state for everything" diff written, but some of the caching logic is tricky and I haven't had a solid chunk of uninterrupted time to stare at it in the last couple days so I don't expect to get it out in time for the release this week.

Apr 20 2018, 9:10 PM · Plans, Ops, Infrastructure, Phacility

amckinley added a revision to T12857: Temporary directory fullness can cause daemon issues?: Restricted Differential Revision.

Apr 20 2018, 7:26 PM · Diffusion, Ops, Daemons, Phacility

amckinley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Restricted Differential Revision.

Apr 20 2018, 7:26 PM · Plans, Ops, Infrastructure, Phacility

Apr 18 2018

epriestley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Restricted Differential Revision.

Apr 18 2018, 4:37 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Restricted Differential Revision.

Apr 18 2018, 3:45 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: Restricted Differential Revision.

Apr 18 2018, 3:43 PM · Plans, Ops, Infrastructure, Phacility

Apr 13 2018

amckinley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: D19368: Add isClusterDevice to Almanac query.

Apr 13 2018, 6:47 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a comment to T13121: Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands.

We can also get a "Something something won't allocate a terminal because something something pseudo-tty." warning without -q. I'm 100% confident this is the exact text of the error.

Apr 13 2018, 11:56 AM · Phacility, Ops

Apr 12 2018

amckinley added a revision to T12857: Temporary directory fullness can cause daemon issues?: D19363: Initial CloudWatch metric reporting support.

Apr 12 2018, 11:25 PM · Diffusion, Ops, Daemons, Phacility

amckinley added a revision to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction: D19363: Initial CloudWatch metric reporting support.

Apr 12 2018, 11:25 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a comment to T13121: Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands.

I'm leaning toward:

Apr 12 2018, 12:11 AM · Phacility, Ops

epriestley triaged T13121: Remove "-q" from SSH commands executed by `bin/remote` and similar cluster commands as Low priority.

Apr 12 2018, 12:08 AM · Phacility, Ops

Apr 11 2018

epriestley closed T12414: Implement Almanac edit endpoints in Conduit, a subtask of T12218: Reduce the operational cost of a larger Phacility cluster, as Resolved.

Apr 11 2018, 5:54 PM · Ops, Phacility

epriestley closed T12414: Implement Almanac edit endpoints in Conduit as Resolved.

I think that's pretty much everything. There will be a little followup work in T10883 and maybe T13076 / T13120.

Apr 11 2018, 5:54 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19343: Allow Almanac properties to be set and deleted via Conduit.

Apr 11 2018, 4:28 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19342: Make various small quality-of-life improvements for Almanac properties.

Apr 11 2018, 3:41 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19341: Allow Almanac Bindings to be enabled/disabled via API and support the "properties" attachment.

Apr 11 2018, 2:17 PM · Conduit, Almanac, Ops, Phacility

epriestley added a comment to T12414: Implement Almanac edit endpoints in Conduit.

Everything here should pretty much work except:

Apr 11 2018, 2:08 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19340: Provide "almanac.binding.search" and "almanac.binding.edit".

Apr 11 2018, 1:40 PM · Conduit, Almanac, Ops, Phacility

Apr 10 2018

epriestley added a comment to T12999: Replace cluster magnetic volumes with SSD volumes.

I think all the T13076 stuff is blocked on me for now unless you're feeling especially ambitious about trying to get bin/host download working on 2GB+ files (T12907 / D19011). That might be a bit of a mess of a task though since I think I have a lot of secret cURL knowledge from over the years that is only somewhat-documented in HTTPSFuture.

Apr 10 2018, 10:47 PM · Phacility, Ops

amckinley added a comment to T12999: Replace cluster magnetic volumes with SSD volumes.

Oh, I'm happy to kick this down the road until we go through the Big Compaction. I just saw this task as a dependency for T13076, which is on this week's planning board and figured I'd tackle it. Is there a different task you think I should work on instead to move T13076 forward?

Apr 10 2018, 7:47 PM · Phacility, Ops

amckinley reassigned T12414: Implement Almanac edit endpoints in Conduit from amckinley to epriestley.

Apr 10 2018, 7:32 PM · Conduit, Almanac, Ops, Phacility

epriestley added a comment to T12999: Replace cluster magnetic volumes with SSD volumes.

Backups should run continuously (starting 12 hours after the instance launches, then every 24 hours after that):

Apr 10 2018, 7:30 PM · Phacility, Ops

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19338: Implement "almanac.interface.search" and "almanac.interface.edit".

Apr 10 2018, 7:22 PM · Conduit, Almanac, Ops, Phacility

amckinley added a comment to T12999: Replace cluster magnetic volumes with SSD volumes.

I'm going to go through the volumes type-by-type instead of host-by-host, starting with the backup volumes (because those should be fine to detach as long as the backup isn't running). It looks like backups run daily at 2300, so that should give me plenty of time.

Apr 10 2018, 7:05 PM · Phacility, Ops

amckinley moved T12847: A Pathway Towards Private Clusters from Soon to Future on the Plans board.

Apr 10 2018, 6:45 PM · Plans, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19337: Add "almanac.namespace.edit" and "almanac.namespace.search" API methods.

Apr 10 2018, 6:41 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19336: Use a more conventional spelling of "Almanac" for "almanac.service.edit" class.

Apr 10 2018, 6:22 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19335: Add "almanac.network.edit" and "almanac.network.search" API methods.

Apr 10 2018, 6:19 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19334: Modularize Almanac property transactions.

Apr 10 2018, 6:03 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: Restricted Differential Revision.

Apr 10 2018, 5:10 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: Restricted Differential Revision.

Apr 10 2018, 5:08 PM · Conduit, Almanac, Ops, Phacility

amckinley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

Hmmm, then maybe the narrowest way to solve this is to just setup disk space alarms. That's a thing we'd want independent of how we eventually build the One True Logging System. The CloudWatch agent collects disk stats out of the box, and takes a list of mountpoints to track. Details here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html

Apr 10 2018, 4:09 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19329: Modularize transactions for Almanac Device.

Apr 10 2018, 3:43 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19328: Remove TYPE_INTERFACE transaction from Almanac Device.

Apr 10 2018, 3:27 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: Restricted Differential Revision.

Apr 10 2018, 2:24 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: Restricted Differential Revision.

Apr 10 2018, 1:38 PM · Conduit, Almanac, Ops, Phacility

epriestley added a comment to T12414: Implement Almanac edit endpoints in Conduit.

Before I can get rid of AlmanacDeviceTransaction::TYPE_INTERFACE, we have two meaningful callsites in rSERVICES and one unit test in rSAAS to clean up.

Apr 10 2018, 1:30 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19325: Use Interface transactions, not Device transactions, to destroy Interfaces.

Apr 10 2018, 1:28 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19324: Edit Interfaces in Almanac with EditEngine.

Apr 10 2018, 1:17 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19323: Add skeleton code for Almanac Interfaces to have real transactions.

Apr 10 2018, 12:57 PM · Conduit, Almanac, Ops, Phacility

epriestley added a comment to T12414: Implement Almanac edit endpoints in Conduit.

There's a bit of a mess with AlmanacInterface and AlmanacDevice. Currently, AlmanacInterface does not use transactions, and is edited purely as a side effect of INTERFACE transactions applying to AlmanacDevice. I'm going to change how this works so that AlmanacInterface is a normal transactional object and can use the same rules and infrastructure as everything else.

Apr 10 2018, 12:34 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19322: Modularize Almanac Network transactions.

Apr 10 2018, 12:21 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19321: Modularize Almanac Binding transactions.

Apr 10 2018, 12:15 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19320: Modularize Almanac Namespace transactions.

Apr 10 2018, 11:51 AM · Conduit, Almanac, Ops, Phacility

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

I think "run weekly with the deploy" already works pretty well for normal log rotation with logs on local disk, and in a perfect world logs would be buffering on disk only briefly before being sent over the network to dedicated log service so we wouldn't need cron there either.

Apr 10 2018, 11:36 AM · Plans, Ops, Infrastructure, Phacility

Apr 9 2018

amckinley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

How often does the size directive in a logrotate script get evaluated?

Apr 9 2018, 11:48 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19318: Allow "almanac.service.edit" to create services.

Apr 9 2018, 9:20 PM · Conduit, Almanac, Ops, Phacility

epriestley added a revision to T12414: Implement Almanac edit endpoints in Conduit: D19317: Partially modularize AlmanacService transactions.

Apr 9 2018, 9:10 PM · Conduit, Almanac, Ops, Phacility

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

Yeah, I think "solve it forever" is probably "first-party Sentry-like application", this just seems like a problem that should have a 15-minute "solve it for now" solution in the form of something that gives us "a thing that looks like a disk but never fills up" that will hold for a year or two until we build "Phentry".

Apr 9 2018, 9:01 PM · Plans, Ops, Infrastructure, Phacility

amckinley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

That makes sense. I was coming at it more from the direction of “if we’re going to touch logs, lets solve it forever”. How about we make everything log locally through syslog and use logrotate to avoid filling disks? Then we can come back and point syslog at something more elaborate later without having to touch all the things that currently generate logs.

Apr 9 2018, 8:40 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

I believe there are zero or nearly-zero actionable errors in the log today and that if we had a perfect Sentry-like system it wouldn't actually help us identify or fix any errors -- we'd just ignore it after a week like we currently ignore all the other production error logs.

Apr 9 2018, 7:49 PM · Plans, Ops, Infrastructure, Phacility

amckinley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

There's some flavor of bin/remote ssh <some-host> -- tail -f <some-log_location>.

Apr 9 2018, 7:33 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.

The "logs fill up disks" problem is more "sudden bursts of unusual error/activity that cause log rates to increase 1000x+ fill up disks", not "logs don't rotate / normal log volumes fill up disks". Most logs do already rotate today, I think with the exception of the aphlict log being a bit of a weird case, maybe.

Apr 9 2018, 7:14 PM · Plans, Ops, Infrastructure, Phacility

epriestley added a comment to T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction.