Free instances on Phacility are currently relatively expensive for us because we can't fit very many on each host. Although burning piles of money isn't great, this is mostly a problem today because it requires us to put lot more hardware in production, not because of the direct financial cost. If we could put more instances per host, we could have fewer hosts and would not feel as much operational pressure to improve support for running a large tier. This would also reduce the cost-per-instance, but simplifying ops is a more urgent issue than lighting less money on fire.
Phabricator itself is scaling fine, but tooling support for operational tasks (like performing deployments and doing monitoring) on a larger cluster is underpowered today. Weekly deployments currently go to 44 hosts right now (16x db, 16x repo, 4x secure, 1x sbuild, 1x saux, 1x build, 1x bastion, 4x web) and all the tooling is one-to-one so I end up running 100+ commands manually every Saturday morning to do a deployment. As far as I know I haven't made a major mistake yet, but I'm sure I will sooner or later, particularly because we'll be deploying more like 100 hosts in a few months without changes to the number of instances per host.
I wrote the deployment tools used at Facebook for a tier size on the order of 200,000 hosts, so not worried about scaling things up technically, but this is effort which needs to happen much sooner if we're supporting a rapidly expanding free tier than if we're only supporting paid instances.
There are two major scalability issues with the free tier:
- Instances are immortal: We currently never suspend free tier instances, and many users launch free instances now instead of test instances, then never return. This means that each free instance imposes a small operational burden forever (we must continue upgrading the schema, etc). We could reasonably put a very modest activity requirement on free instances like: if no one logs in for 30 days, send a warning email that the instance will be suspended if it remains unused. If no one logs in for an additional 30 days, suspend the instance.
- Instances run dedicated daemons: Each active instance currently runs its own daemon processes: 1x Overseer, 1x PullLocal, 1x Trigger, and 1x Taskmaster. The memory used by these processes is currently the scalability bottleneck for how many instances we can fit per host.
On one shard which was active when free instances launched, approximately 60% of all active instances have no web sessions established or updated in the last 30 days. I'd guess this underestimates the long-term impact of an activity requirement, because it includes instances which were launched in the last 30 days and haven't yet had 30 days to sit idle. But this would likely reduce the number of active instances by at minimum of 2x, and 4x seems plausible to me (previous changes to automatically suspend paid instances that don't actually pay us had a similarly dramatic effect on culling test instances).
The requirement that users establish one web session every 60 days seems very reasonable (perhaps insufficiently onerous, even?). Instances seeing any kind of use should never see this warning. Although it is technically possible to use an instance purely as a repository store or purely over Conduit, which would still be activity but not show up on "web sessions" queries, these seem like extreme edge cases and it seems reasonable to require someone to load a page once every 60 days if you're using Phacility as a storage/API appliance.
This is pretty straightforward to implement, although not trivial because we need to send users a bunch of warning mail and track those warnings on instances. It also won't have much of an impact for a while: the free tier has only been live for ~75 days, and we need to wait 30 days after any initial implementation goes live before we start suspending instances, to give administrators fair warning. So even if this went to production on Saturday (Feb 11) it wouldn't start reducing active instance count until around March 11th.
We also don't actually record "last login" anywhere: instead, we refresh sessions if they're more than 20% used up. So we need to put a ~20% buffer on this stuff and can't currently show users a precise "last login date" (we could show "Last login: before X", calculated as expiration - 80% of session length).
I think we should pursue this in the short term unless we expect to turn off free tier registrations in less than a month.
Phabricator currently relies on each instance having multiple active daemon processes. In theory, we can merge these daemons into a "megadaemon" which runs daemon activity for multiple instances.
There are three daemons, and each daemon has different requirements.
Taskmaster Daemons: These send mail, update search indexes, import repositories, etc. For most instances, they are sitting idle most of the time because there are no tasks in queue. If we could have them run only when tasks are in queue, these processes would not need to be running most of the time.
However, they need to start quickly when there are tasks in queue. Although it might be technically acceptable, it's undesirable for mail or notifications or feed stories to lag by 15 seconds after you take an action. Currently, this latency is more on the order of about 1 second.
The overseer could conceivably poll the activetask table and start taskmasters only if the table was nonempty (put another way: allow the pool to autoscale down to 0 nodes). Better would be to have instances queue themselves into some "needs attention" list. A mega-daemon could then poll the list and start taskmaster pools as required. This requires a significant deviation from how daemons work and a nontrivial amount of upstream code which only supports free instancing in the cluster.
Trigger Daemons: These run periodic events, garbage collection, nuance imports, and Calendar imports. Currently, most instances only use GC, and would not be negatively impacted if these daemons ran periodically every few hours instead of continuously. Once Calendar and Nuance unprototype, these will potentially need to run continuously so they can poll datasources for Calendar imports and Nuance updates, and deliver event notifications promptly. Future applications, like a hypothetical Pager/Oncall application, may have additional notification requirements.
PullLocal Daemons: These fetch/import/update remote repositories. They must run more or less continuously for instances which observe remote repositories in order to have reasonable latency for discovering commits. Although they have long idle periods in many cases, inactive instances which observe third-party repositories that they don't own will see regular commit activity and have short sleep periods (for example, if test-lul-lol.phacility.com imports Phabricator itself, its daemon will be regularly active).
We could also consider reducing the service level for free instances (say: no observing repositories, no importing calendars, no importing Nuance sources), but I think that repository observation is important for onboarding, and I generally don't like having different service levels for different tiers. Beyond various user costs, they make everything more complicated to operate and administrate. If we do degrade service for the free tier, I'd rather do it through parameter tuning (e.g., make instances a little less responsive) than through explicitly removing features.
All of this is quite complicated and I don't have a technical plan for it yet. A conservative plan is probably to support autoscaling pool sizes down to 0, which gives us a ~4x improvement in instances per host without a tremendous amount of complexity. A more aggressive plan is to build "mega-daemons", which let us collapse multiple instances down into one daemon. This is more complex.
I'd like to try to develop a technical plan which has headroom for mega-daemons but routes us through scale-to-0 first, and hopefully have a fairly logical progression where one mostly builds on the other and we can send scale-to-0 into production first.