The current instance metadata caching strategy isn't great. The new instances.state method intends to replace it, but is not yet in production.
- See T11413. See PHI63. See PHI107. See PHI128. These are instance rename requests mostly blocked by moving to instances.state.
Maybe see T12801 alongside this. See also T12646. This may moot T12297.
I'd like to rebalance and compact the tier. We have a large amount of extra hardware from when we supported free instances, and some of it has fallen off the reserved instance list, so this is pretty much a money fire. There are two major issues:
- (D19011, etc) Downloading 2GB+ files via arc doesn't work since we store the whole response body in a big string.
- (T12830) The UI and allocation logic for shards need tweaking.
See T12999. This likely makes sense alongside rebalance/compaction.
See T12988. This is toolset cleanup.
See T12608. It would be nice to complete this key cycling.
See T12917. It would be nice to complete these domain transfers. This is blocked by needing wildcard MX. I have personal wildcard MX and may just use this. See also T13062.
I'd like to analyze data on the $5 tier with an eye to removing it. Anecdotally, this doesn't seem to be doing anything good for us, but is creating a couple of headaches where the huge $50/month -> $220/month jump at 11 users is enough to prompt instances to export and self-host. It also feels like we have a lot of $5 test instances.
When phage is executed against a large number of hosts, a bunch of the processes fail immediately. This is likely easy to fix, either some sort of macOS ulimit or one of the sshd knobs. This isn't the limiting factor in any operational activity today but will be some day.
See T12611. See T12857. It would be nice to get logging off hosts and into S3 or some centralized log store. The major advantage is that this simplifies log volume management and reduces headaches associated with disks filling up (and impacting other things) when logging gets active. It also potentially makes log analysis easier in the future. It's not clear what the best pathway forward here is. Some logs (notably, the apache log) can't be captured/directed directly from PHP, although we could pipe them into some PHP helper. Although I'm generally uneasy about building infrastructure which depends on Phabricator working, I'm less uneasy about doing this for logs (which aren't actually very critical), so I think sending everything through Phabricator is at least on the table.
Some other logging considerations:
- Hosts generate non-instance logs, so we can't always pipe everything to a local instance.
- S3 can't append.
- There's some tentative interest from instances in getting some logs via HTTP push, although I think this was several different use cases which are perhaps better addressed in other ways.
- I definitely don't want to send logs to a third-party service.
- This bleeds into error monitoring.
I'd like to just do something dumb here, but attaching 50 log volumes by hand feels a little too dumb and the path to automating that has a lot of separate concerns which I don't want to delve into if I can avoid it: this is the "stabilize everything before we automate provisioning" plan.