Page MenuHomePhabricator

Plans: Phacility cluster caching, renaming, and rebalance/compaction
Closed, ResolvedPublic

Description

The current instance metadata caching strategy isn't great. The new instances.state method intends to replace it, but is not yet in production.

  • See T11413. See PHI63. See PHI107. See PHI128. These are instance rename requests mostly blocked by moving to instances.state.

Maybe see T12801 alongside this. See also T12646. This may moot T12297.

I'd like to rebalance and compact the tier. We have a large amount of extra hardware from when we supported free instances, and some of it has fallen off the reserved instance list, so this is pretty much a money fire. There are two major issues:

  • (D19011, etc) Downloading 2GB+ files via arc doesn't work since we store the whole response body in a big string.
  • (T12830) The UI and allocation logic for shards need tweaking.

See T12999. This likely makes sense alongside rebalance/compaction.

See T12988. This is toolset cleanup.

See T12608. It would be nice to complete this key cycling.

See T12917. It would be nice to complete these domain transfers. This is blocked by needing wildcard MX. I have personal wildcard MX and may just use this. See also T13062.

I'd like to analyze data on the $5 tier with an eye to removing it. Anecdotally, this doesn't seem to be doing anything good for us, but is creating a couple of headaches where the huge $50/month -> $220/month jump at 11 users is enough to prompt instances to export and self-host. It also feels like we have a lot of $5 test instances.

When phage is executed against a large number of hosts, a bunch of the processes fail immediately. This is likely easy to fix, either some sort of macOS ulimit or one of the sshd knobs. This isn't the limiting factor in any operational activity today but will be some day.

See T12611. See T12857. It would be nice to get logging off hosts and into S3 or some centralized log store. The major advantage is that this simplifies log volume management and reduces headaches associated with disks filling up (and impacting other things) when logging gets active. It also potentially makes log analysis easier in the future. It's not clear what the best pathway forward here is. Some logs (notably, the apache log) can't be captured/directed directly from PHP, although we could pipe them into some PHP helper. Although I'm generally uneasy about building infrastructure which depends on Phabricator working, I'm less uneasy about doing this for logs (which aren't actually very critical), so I think sending everything through Phabricator is at least on the table.

Some other logging considerations:

  • Hosts generate non-instance logs, so we can't always pipe everything to a local instance.
  • S3 can't append.
  • There's some tentative interest from instances in getting some logs via HTTP push, although I think this was several different use cases which are perhaps better addressed in other ways.
  • I definitely don't want to send logs to a third-party service.
  • This bleeds into error monitoring.

I'd like to just do something dumb here, but attaching 50 log volumes by hand feels a little too dumb and the path to automating that has a lot of separate concerns which I don't want to delve into if I can avoid it: this is the "stabilize everything before we automate provisioning" plan.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rPHU libphutil
D19363
rP Phabricator
D19368

Related Objects

Mentioned In
T13542: Rebalance Phacility instances into a private subnet
T13242: 2019 Week 5 Bonus Content
T13207: Cycle More AWS Hosts (October 2018)
T12171: Node is now very exciting to install on Ubuntu 14?
T13182: Something in the Ubuntu environment changed, affecting node / "ws"
T13164: Plans: 2018 Week 31 - 33 Bonus Content
T13156: Plans: Improve Phacility UI for managing instance managers and cards
T13168: epriestley new laptop / not reading documentation setup issues
T13167: AWS is rebooting several production hosts (July 2018)
T13124: Plans: 2018 Week 16 Bonus Content
T12414: Implement Almanac edit endpoints in Conduit
T12999: Replace cluster magnetic volumes with SSD volumes
T12847: A Pathway Towards Private Clusters
T13088: Plans: Harbormaster UI usability and interconnectedness
T12218: Reduce the operational cost of a larger Phacility cluster
Mentioned Here
T13542: Rebalance Phacility instances into a private subnet
T8204: Stop using EC2 micro instances
T7811: Phacility Cluster: Apache exited on repo001.phacility.net
T13100: PCRE segfaults readily with default "pcre.backtrack_limit" and "pcre.recursion_limit" values
D19011: Provide a streaming HTTP response parser
T11413: Support renaming Phacility instances
T12297: Make Conduit API calls on `admin.phacility.com` reasonably easy to profile
T12611: Write Phabricator HTTP and SSH logs in the production cluster
T12646: Reduce the impact of "admin" backups on instances
T12801: Simplify Almanac services in the Phacility production cluster
T12830: Disentangle "repoXYZ = dbXYZ" in the Phacility cluster
T12857: Temporary directory fullness can cause daemon issues?
T12917: Move domain name registration and SSL to AWS
T12988: Remove flag "--master" from bin/remote
T12999: Replace cluster magnetic volumes with SSD volumes
T13062: Trying to manage anything in Gsuite is kind of not great?

Event Timeline

epriestley triaged this task as Normal priority.Feb 14 2018, 2:10 PM
epriestley created this task.
epriestley renamed this task from Plans: Phacility cluster infrastructure improvements to Plans: Phacility cluster caching, renaming, and rebalance/compaction.Feb 14 2018, 2:18 PM
epriestley added a commit: Restricted Diffusion Commit.Mar 9 2018, 7:57 PM

When phage is executed against a large number of hosts, a bunch of the processes fail immediately. This is likely easy to fix, either some sort of macOS ulimit or one of the sshd knobs. This isn't the limiting factor in any operational activity today but will be some day.

This was a default MaxStartups config in default.conf for SSHD which limited inbound connections to the bastion. I raised it to 1024 and phage can now hit all 50 repo+db hosts in about 4 seconds.

epriestley added a revision: Restricted Differential Revision.Mar 30 2018, 10:22 PM
epriestley added a commit: Restricted Diffusion Commit.Mar 30 2018, 10:25 PM

The UI and allocation logic for shards need tweaking.

D19275 fully decouples the "dbX = repoX" shard allocation logic so the tiers may have different sizes and they can be rebalanced independently without causing the balancing algorithm to do anything weird in the general case.

I'd like to analyze data on the $5 tier with an eye to removing it. Anecdotally, this doesn't seem to be doing anything good for us ... It also feels like we have a lot of $5 test instances.

Although it has only been live for about about 8 months, only two instances have ever upgraded from the "starter" plan to the "standard" plan, and both look like users testing things or making mistakes.

I think this plan is a net negative: it adds more friction to instance creation and creates a huge pricing jump from 10 to 11 users where your bill goes from $50/month to $220/month. When instances reasonably resolve this by self-hosting, it creates a support headache.

One very clever instance has resolved this by disabling their 10th user, inviting an 11th user, then re-enabling the 10th user to get around the limit. I intentionally allowed this when building the user limit since preventing it was fairly difficult, but I am pleased to see that users are industrious and ethically ambitious enough to take advantage of it. This is hard to scale too far -- to invite a 20th user you need to disable 10 users, invite the 20th user, then re-enable 10 users -- but could perhaps be scripted with enough motivation.

I'm going to disable this plan for new instances.

epriestley added a revision: Restricted Differential Revision.Mar 30 2018, 10:57 PM
epriestley added a commit: Restricted Diffusion Commit.Mar 30 2018, 11:00 PM

I definitely don't want to send logs to a third-party service.

Does that include AWS CloudWatch? I've used it previously, and its at least very easy to setup. We just install their little logging agent, point its config file at our logs, and stuff magically appears in the AWS console (and is accessible via their API). There's no magic "dump logs to S3 all the time" button that I can find, which is surprising, but we can do a cron job that kicks off export jobs using this API.

I think the only point of moving stuff out of CloudWatch and into S3 is that we could then pay less for CloudWatch by using S3 for retention, and set CloudWatch's retention to something aggressive like a week. If we didn't want to use S3 for retention, we can also redirect CloudWatch logs into Kinesis firehose or whatever else and use that as our log aggregation layer.

I think the real question is how much we like the existing CloudWatch alarms I setup last year, and if we want to put more investment in that. Installing the local CloudWatch agent would also let us collect more custom metrics (daemon queue depth and DB stats immediately leap to mind) and alarm on them. CloudWatch also lets you alarm on log keywords (count("Segmentation Fault") > X/hour) or log volume.

I'm fine with CloudWatch logs, I didn't realize it had a logging thing. I'm currently not terribly warm on investing a ton in CloudWatch alarms or log analysis, but if we can get logs reliably streaming into some storage service so they can't fill up the disk I think that's quite valuable on its own. We can always replace that with something we control directly in the future if we want to invest more here. I can't imagine ever really caring about old logs no matter what happens in the future.

(But I don't think this is too valuable if it doesn't solve the "logs fill up disks" problem -- does the agent rotate/reap the logs? In my mind, the only real problem here is "logs fill up disks", and all the analysis stuff is like a vague future thing we might want to look at some day.)

Possibly something dumb like "mount an EFS volume on /mnt/logs/ on every host" is another AWS-only approach. That feels really dumb but maybe it's only somewhat dumb?

In my mind, the only real problem here is...

I'd also really like a command like phage tail <some-instance-name> <some-log-type> that can run anywhere (or at least run on the bastion).

That feels really dumb but maybe it's only somewhat dumb?

I'm philosophically opposed to relying on locally-mounted disks for logs in general (except for storing backlog when the network-based log infrastructure is slow). I picture log lines as desperately wanting to be free, itching to escape before some disaster befalls the host they were generated on. I know EBS volumes don't exactly go down with the ship, but if a host dies and we want to check the logs, it's kind of a hassle to force-detach the volume and remount it somewhere else (and hope that the lines we're looking for got flushed to disk before the machine died).

How are we rotating log files right now? Are we relying on whatever apache/sshd/etc do by default, or do we use logrotate, or what? I'd like to live in a world where everything goes through a centralized syslog/CloudWatch-esque layer, and we write the logs to local disk as an afterthought.

But I don't think this is too valuable if it doesn't solve the "logs fill up disks" problem -- does the agent rotate/reap the logs?

It looks like the "old" version of the CloudWatch agent can rotate the logs, but the "new/unified" agent doesn't have those options? Either way, if we're going to continue to rely on local disk log files, we should just make a logrotate config that deletes logs older than a few days:

The rotate keyword allows you to specify how many old, rotated, log files are kept before logrotate deletes them. The rotate keyword requires an integer to specify how many old log files are kept.

Example:

/var/log/myapp/*.log {
  size 10k
  weekly
  rotate 8
}

phage tail <some-instance-name> <some-log-type>

There's some flavor of bin/remote ssh <some-host> -- tail -f <some-log_location>. We could buff this but I basically never actually care about examining logs in a vacuum when doing ops stuff today. (If I do, I'm usually looking at ps auxwww on the host first anyway, and can just use tail normally.)

I'd be more eager to invest time in a real solution here (aggregate, parse, alert) if logs were routinely valuable, but they don't seem all that important in accomplishing any particular goal today. T13100 did have a "segfault" in the logs, but I could never have figured it out without the corresponding email report, so even if we had an alert on it we'd likely have just ignored it until the email report showed up.

How are we rotating log files right now?

They're (mostly) rotated by bin/host upgrade when hosts are deployed.

I disabled logrotate for httpd in T7811 because it would kill apache dead at some random time when I was asleep under memory pressure.

The "logs fill up disks" problem is more "sudden bursts of unusual error/activity that cause log rates to increase 1000x+ fill up disks", not "logs don't rotate / normal log volumes fill up disks". Most logs do already rotate today, I think with the exception of the aphlict log being a bit of a weird case, maybe.

There's some flavor of bin/remote ssh <some-host> -- tail -f <some-log_location>.

That's true, but what I'm envisioning wouldn't rely on the host in question actually being alive. phage tail would be making API calls against CloudWatch (or the equivalent), and you wouldn't have to know the location of the logfile. This all goes back to my general philosophy of "every time you SSH to a host to do something routine, that represents a failure of tooling".

As far as the value of (aggregate, parse, alert) goes, I've had great luck at other organizations wiring up stuff like Sentry to an application's error logs, so you get a nice dashboard with info like "hey, these 300 errors have been happening routinely in the background forever and you can ignore them, but THIS error just started firing at the rate of 1500/second and the timing correlates with the most recent deploy". That's outside the scope of this "get logs squared away" task, but it's a great example of how we might proactively catch something like T13100.

I believe there are zero or nearly-zero actionable errors in the log today and that if we had a perfect Sentry-like system it wouldn't actually help us identify or fix any errors -- we'd just ignore it after a week like we currently ignore all the other production error logs.

T13100 didn't produce an actionable error that we could have proactively caught (it just said "apache segfaulted" with no identifying information) and I think that basically nothing that anyone ever reports via Support is something that we could have caught and fixed by examining the error logs (I can't recall any cases offhand, at least).

We've also never actually lost a host (in a way such that the logs became inaccessible) other than micro instances in T8204, so I think making logs durable after the death of a host is currently low-value.

I'm totally onboard with the value of this stuff in theory, I just don't think it would have any meaningful impact on customers or operations or support today.

I don't think this is a very hard problem (the total log volume is very low, and all our logging could trivially be handled on one host), and we could build a really great log aggregation system in a few days that did exactly what we wanted and worked perfectly for our environment (e.g., totally accurate aggregation of PHP stacks, awareness of metadata like instances and users and shards, etc). I just think it's probably not a very useful thing to spend time on since it doesn't actually solve any customer or operational problems except "bursty log volumes fill up disks", and ideally we can solve that in a few minutes instead of a few days.

That makes sense. I was coming at it more from the direction of “if we’re going to touch logs, lets solve it forever”. How about we make everything log locally through syslog and use logrotate to avoid filling disks? Then we can come back and point syslog at something more elaborate later without having to touch all the things that currently generate logs.

Yeah, I think "solve it forever" is probably "first-party Sentry-like application", this just seems like a problem that should have a 15-minute "solve it for now" solution in the form of something that gives us "a thing that looks like a disk but never fills up" that will hold for a year or two until we build "Phentry".

How often does the size directive in a logrotate script get evaluated? If logrotate is run by cron and the default is "daily" in /etc/cron.daily/logrotate, we'd probably need to change that too, and then we'd have new dependencies on both cron and logrotate on all production hosts (and syslog if we're routing through that). Not the end of the world but feels like maybe more of a step sideways than forward.

How often does the size directive in a logrotate script get evaluated?

Daily by default, as you speculated. We can change that to run once an hour or whatever. The default syslog has its own rotation implementation that also runs daily by default (as configured in /etc/cron.daily/sysklogd).

Personally, I feel pretty comfortable depending on cron running as promised, but if we aren't, what would the alternative be? It sounds like we'd need a persistent daemon running everywhere, which feels more fragile.

To go in a slightly different direction, having read through the post-mortem in T7811, I'm also somewhat in favor of using systemd, which has only gotten more ubiquitous and well-recommended since 2015.

I think "run weekly with the deploy" already works pretty well for normal log rotation with logs on local disk, and in a perfect world logs would be buffering on disk only briefly before being sent over the network to dedicated log service so we wouldn't need cron there either.

I'm hesitant to add a bunch of new moving parts purely as a stopgap. If we can't reasonably configure "a thing that looks a lot like an infinitely-sized disk", I'd tend to favor something like php -f append-if-smaller-than.php path/to/error.log 100MB which reads stdin and appends it to $1, but only if $1 is smaller than $2. That adds a handful of lines of code, no new services/cron/configuration, and stops the "bursty logs cascade into other types of failures" problem.

This also isn't an especially urgent problem, and it's possible that we just accept the status quo (log to local disk; rotate weekly with deploymnent; occasionally deal with some mess from disk fullness) and then make the jump to "proper log service" sooner than we otherwise would have if we could configure an intermediate "infinitely-sized append-only disk". If we're on this path, I'd hold this project until after private clusters are deployed, though.

Hmmm, then maybe the narrowest way to solve this is to just setup disk space alarms. That's a thing we'd want independent of how we eventually build the One True Logging System. The CloudWatch agent collects disk stats out of the box, and takes a list of mountpoints to track. Details here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html

epriestley added a revision: Restricted Differential Revision.Apr 18 2018, 3:43 PM
epriestley added a revision: Restricted Differential Revision.
epriestley added a revision: Restricted Differential Revision.Apr 18 2018, 4:37 PM
epriestley added a commit: Restricted Diffusion Commit.Apr 20 2018, 12:25 AM
epriestley added a commit: Restricted Diffusion Commit.
amckinley added a revision: Restricted Differential Revision.Apr 20 2018, 7:26 PM
epriestley added a commit: Restricted Diffusion Commit.Apr 20 2018, 9:03 PM
amckinley added a commit: Restricted Diffusion Commit.Apr 20 2018, 9:08 PM

As a general update here since a couple installs are keeping an eye on it, I have a big chunk of the "use instance.state for everything" diff written, but some of the caching logic is tricky and I haven't had a solid chunk of uninterrupted time to stare at it in the last couple days so I don't expect to get it out in time for the release this week.

D19383 is going out, and at least gets instance.state into production, albeit in a very secondary role. It also looks like some metrics improvements are shipping.

I somewhat-recently realized that the trackpad and keyboard on my laptop haven't worn out: they've stopped working well because the battery has swollen and physically expanded the case so the tolerances on most of the keys and the clicking action on the pad don't line up anymore. When you look at it askew, you can see that the whole case has a big bulge in the middle. Thus, I've been spending some time setting up a new development environment on a current-generation Macbook in case my current one literally explodes. This isn't quite set up yet but I have it mostly working after some work on it this week, which included some infrastructure stuff like moving my personal Phabricator install to AWS domain registration, etc.