Page MenuHomePhabricator

Reduce the hardware cost of Phacility free instances
Closed, InvalidPublic

Description

Free instances on Phacility are currently relatively expensive for us because we can't fit very many on each host. Although burning piles of money isn't great, this is mostly a problem today because it requires us to put lot more hardware in production, not because of the direct financial cost. If we could put more instances per host, we could have fewer hosts and would not feel as much operational pressure to improve support for running a large tier. This would also reduce the cost-per-instance, but simplifying ops is a more urgent issue than lighting less money on fire.

Phabricator itself is scaling fine, but tooling support for operational tasks (like performing deployments and doing monitoring) on a larger cluster is underpowered today. Weekly deployments currently go to 44 hosts right now (16x db, 16x repo, 4x secure, 1x sbuild, 1x saux, 1x build, 1x bastion, 4x web) and all the tooling is one-to-one so I end up running 100+ commands manually every Saturday morning to do a deployment. As far as I know I haven't made a major mistake yet, but I'm sure I will sooner or later, particularly because we'll be deploying more like 100 hosts in a few months without changes to the number of instances per host.

I wrote the deployment tools used at Facebook for a tier size on the order of 200,000 hosts, so not worried about scaling things up technically, but this is effort which needs to happen much sooner if we're supporting a rapidly expanding free tier than if we're only supporting paid instances.


There are two major scalability issues with the free tier:

  • Instances are immortal: We currently never suspend free tier instances, and many users launch free instances now instead of test instances, then never return. This means that each free instance imposes a small operational burden forever (we must continue upgrading the schema, etc). We could reasonably put a very modest activity requirement on free instances like: if no one logs in for 30 days, send a warning email that the instance will be suspended if it remains unused. If no one logs in for an additional 30 days, suspend the instance.
  • Instances run dedicated daemons: Each active instance currently runs its own daemon processes: 1x Overseer, 1x PullLocal, 1x Trigger, and 1x Taskmaster. The memory used by these processes is currently the scalability bottleneck for how many instances we can fit per host.

Activity Requirement

On one shard which was active when free instances launched, approximately 60% of all active instances have no web sessions established or updated in the last 30 days. I'd guess this underestimates the long-term impact of an activity requirement, because it includes instances which were launched in the last 30 days and haven't yet had 30 days to sit idle. But this would likely reduce the number of active instances by at minimum of 2x, and 4x seems plausible to me (previous changes to automatically suspend paid instances that don't actually pay us had a similarly dramatic effect on culling test instances).

The requirement that users establish one web session every 60 days seems very reasonable (perhaps insufficiently onerous, even?). Instances seeing any kind of use should never see this warning. Although it is technically possible to use an instance purely as a repository store or purely over Conduit, which would still be activity but not show up on "web sessions" queries, these seem like extreme edge cases and it seems reasonable to require someone to load a page once every 60 days if you're using Phacility as a storage/API appliance.

This is pretty straightforward to implement, although not trivial because we need to send users a bunch of warning mail and track those warnings on instances. It also won't have much of an impact for a while: the free tier has only been live for ~75 days, and we need to wait 30 days after any initial implementation goes live before we start suspending instances, to give administrators fair warning. So even if this went to production on Saturday (Feb 11) it wouldn't start reducing active instance count until around March 11th.

We also don't actually record "last login" anywhere: instead, we refresh sessions if they're more than 20% used up. So we need to put a ~20% buffer on this stuff and can't currently show users a precise "last login date" (we could show "Last login: before X", calculated as expiration - 80% of session length).

I think we should pursue this in the short term unless we expect to turn off free tier registrations in less than a month.


MegaDaemons

Phabricator currently relies on each instance having multiple active daemon processes. In theory, we can merge these daemons into a "megadaemon" which runs daemon activity for multiple instances.

There are three daemons, and each daemon has different requirements.

Taskmaster Daemons: These send mail, update search indexes, import repositories, etc. For most instances, they are sitting idle most of the time because there are no tasks in queue. If we could have them run only when tasks are in queue, these processes would not need to be running most of the time.

However, they need to start quickly when there are tasks in queue. Although it might be technically acceptable, it's undesirable for mail or notifications or feed stories to lag by 15 seconds after you take an action. Currently, this latency is more on the order of about 1 second.

The overseer could conceivably poll the activetask table and start taskmasters only if the table was nonempty (put another way: allow the pool to autoscale down to 0 nodes). Better would be to have instances queue themselves into some "needs attention" list. A mega-daemon could then poll the list and start taskmaster pools as required. This requires a significant deviation from how daemons work and a nontrivial amount of upstream code which only supports free instancing in the cluster.

Trigger Daemons: These run periodic events, garbage collection, nuance imports, and Calendar imports. Currently, most instances only use GC, and would not be negatively impacted if these daemons ran periodically every few hours instead of continuously. Once Calendar and Nuance unprototype, these will potentially need to run continuously so they can poll datasources for Calendar imports and Nuance updates, and deliver event notifications promptly. Future applications, like a hypothetical Pager/Oncall application, may have additional notification requirements.

PullLocal Daemons: These fetch/import/update remote repositories. They must run more or less continuously for instances which observe remote repositories in order to have reasonable latency for discovering commits. Although they have long idle periods in many cases, inactive instances which observe third-party repositories that they don't own will see regular commit activity and have short sleep periods (for example, if test-lul-lol.phacility.com imports Phabricator itself, its daemon will be regularly active).

We could also consider reducing the service level for free instances (say: no observing repositories, no importing calendars, no importing Nuance sources), but I think that repository observation is important for onboarding, and I generally don't like having different service levels for different tiers. Beyond various user costs, they make everything more complicated to operate and administrate. If we do degrade service for the free tier, I'd rather do it through parameter tuning (e.g., make instances a little less responsive) than through explicitly removing features.

All of this is quite complicated and I don't have a technical plan for it yet. A conservative plan is probably to support autoscaling pool sizes down to 0, which gives us a ~4x improvement in instances per host without a tremendous amount of complexity. A more aggressive plan is to build "mega-daemons", which let us collapse multiple instances down into one daemon. This is more complex.

I'd like to try to develop a technical plan which has headroom for mega-daemons but routes us through scale-to-0 first, and hopefully have a fairly logical progression where one mostly builds on the other and we can send scale-to-0 into production first.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision

Event Timeline

Can't the same tricks used to make a single web server handle multiple instances be used for the daemons?

Also: I think it's reasonable that if no event has happened in some time (A week? A month?), the Taskmaster would take 15 seconds to react to come back when it's needed (i.e., allow auto-scale to 0 if last task was a long time ago, although I'm not familiar with how the overseer and daemon scaling actually works).

The PullLocal daemons have a back-off already, and theoretically when the back-off becomes high enough they could go offline and have the overseer / trigger run them again at a known date (Although this sounds like a lot of new code).

Can't the same tricks used to make a single web server handle multiple instances be used for the daemons?

Not directly. No webserver process ever handles more than one instance simultaneously: a request comes in, a new process starts, that new process figures out which instance it is for, then execution occurs normally. Once the page is done, the process state is thrown away.

Daemons don't work like that, and we'd need application changes to make them work like this. For example, PhabricatorEnv::getEnvConfig('...') doesn't make sense if several instances with potentially different configuration are executing in the same process. We'd need to do something like pass a $current_instance object around everywhere, and put all global state (caches, config, database connections, etc) on that. We'd also need to audit all our use of static and make sure we never put any potentially variant data into static properties. We're pretty good about this, but probably not perfect, and the cost of getting this wrong might be very high (policy or security bugs).

This is probably not impossible, but it's very involved and I don't think we stand to gain much from it. We're still single-threaded so we'd only ever really be executing one instance at a time anyway, and the startup/shutdown overhead of a PHP process is not very large (~100ms) compared to the acceptable average latency on this stuff (1-2s). At least for the forseeable future, I think we're best off avoiding any approach where one process executes actual application code for multiple instances.

That is, the multi-process approach looks like this:

  • Process A starts.
  • Process A does work for instance A.
  • Process A exits.
  • Process B starts.
  • Process B does work for instance B.
  • Process B exits.
  • Process C starts.
  • Process C does work for instance C.
  • Process C exits.

The single-process approach looks like this:

  • Process A starts.
  • Process A does work for instance A.
  • Process A switches context to instance B.
  • Process A does work for instance B.
  • Process A switches context to instance C.
  • Process A does work for instance C.

I think this distinction only matters if "exit + start" has a high CPU cost and we're CPU bound. But it doesn't cost much (relative to latencies and job sizes) and we're memory bound, not CPU bound, at least for now. Both approaches should basically have the same memory requirements (if anything, multi-process should be a little lower). But I think the engineering cost to get us to the context switch approach is way higher.

If we get something like a 100x improvement on the memory stuff we may end up CPU bound and want to pursue this. By reusing execution environments in the webserver we can potentially get performance improvements, too, and the two projects would be blocked by many of the same application-level changes (particularly, making sure that our use of static is always safe). But I'd guess this is a year or more away.

Also: I think it's reasonable that if no event has happened in some time (A week? A month?), the Taskmaster would take 15 seconds to react to come back when it's needed (i.e., allow auto-scale to 0 if last task was a long time ago, although I'm not familiar with how the overseer and daemon scaling actually works).

If everything else worked and the only problem we had was that Taskmasters took a long time to start up again after a long downtime, I think we could solve this in about 30 minutes of work:

  • When an instance inserts a task, it also does an INSERT IGNORE INTO some_global_database.ready-instances (whichInstance, when) VALUES ("instance name", UNIX_TIMESTAMP()).
  • Whatever is waiting 15 seconds just does a SELECT from that table instead and starts the taskmaster after 1 second instead of after 15 seconds.

I think this behavior would be a reasonable one, but allowing it as acceptable doesn't make the problem much easier because it only saves a little bit of work.

The PullLocal daemons have a back-off already, and theoretically when the back-off becomes high enough they could go offline and have the overseer / trigger run them again at a known date

Yeah, the immediate issue is that autoscaling is entirely in the actual daemons right now, and the overseer doesn't know anything about how offline/restarting works or about what the daemons are doing.

The daemons either tell the overseer "I'm busy" or "I'm idle". The overseer starts more daemons if all the daemons are busy for a while and there's room, and it stops some daemons if the daemons are idle for a while and there's more than one daemon.

It can't stop the last daemon because it doesn't know anything about when they need to be restarted. It could just restart them periodically, but it can't wait too long (it's probably not OK if an active instance routinely takes 60 seconds to start sending mail). So just having a static "restart every so often" would hurt responsiveness a bit and not get us a ton of breathing room.

I'm planning to put a little bit more logic about the daemons into the overseer, so the damon can say "I'm going to exit now, restart me in 60 seconds", or "run a query every second and restart me once it finds results". I don't love this, because it's very important that overseers are stable, and putting more application logic into them gives us more opportunities to break, but I think it's not a huge change and will buy us about 4x more room with no meaningful loss in responsiveness. The rules also have to be a little fancier than this, because we should restart the Pull daemon after a while or if a new repository is created, or a diffusion.looksoon comes in. But I think this isn't totally unreasonable, and if we can get that 4x headroom that probably means we're no longer bottlenecked on RAM and can go fix some other problem instead.

We're currently at ~82% cluster fullness which makes me slightly uncomfortable, but I'm not going to add more hardware quite yet since I think I can probably get "scale to 0" into production next week. If it's more involved than I think I'll likely put shards 017-020 into production in the next week or so.

So I've read this a few times and I'm still a little unclear what happens if "activity requirement" isn't met. It sounds like we'll just disable them like we do non-paying instances - which I don't think users expect and feels like a crumby experience. Would it just be possible to "hibernate/sleep" the instance (no daemons run, user has to wait and spin them up to log back in if 90 days later they get back to using the instance)?

I'm also fine with a soft approach of "this seems inactive, if you're not using it click here to disable your instance". I'd think some percentage of people would self-disable if presented with the option.

I suppose we could get pretty far changing the language from "Disabled" to "Inactive". That is, send a warning after n days, then after 3 warnings, set it to "Inactive" but have a means they can contact us or click a button to re-activate it?

My thinking is that v1 would work like this:

  • If no users load any page on an instance for 30 days, we send administrators an email saying "it looks like you aren't using your instance so we're going to clean it up. Have any user load any page sometime in the next 15 days if you want to keep it".
  • If anyone logs in or performs any other session activity (any user loading any page is good enough), the timer resets.
  • If no one logs in for 45 days, we suspend the instance (just like how nonpayment works), with a second message ("no one has touched this instance in 45 days so we're suspending it, shoot us an email if you want to keep it. All your stuff will be destroyed 90 days from now").
  • 90 days after suspension, the instance data is destroyed.

We can build a "hibernation" mode which awakens on access, but I think it's a substantial amount of work. For example, users might "wake" the instance by running arc diff, which means a Conduit call to an instance needs to be able to restart daemons on a different machine, and there's a bunch of grey area like CI/email webhooks where a request will come in but it doesn't necessarily indicate instance activity. When an instance wakes, it may also need to do an arbitrarily large amount of work, possibly including sending tons of emails. And if instances are merely hibernating, we need to keep upgrading them with migrations every week, which means that they make pushes slower by a large margin.

We could let users click a button to unsuspend the instance instead of emailing us, but I would guess that almost no one will email us so that even a self-serve unsuspend isn't worth building, much less a hibernate feature.

If that proves untrue, and there are lots of users who don't load a single page for 45 days but want to keep their instances, we could build self-serve unsuspend. If we're still seeing tons of users not load anything for 45 days, then manually unsuspend, then complain that they needed to wait a few minutes for the unsuspend to finish, we could look at building a "hibernate" mode so the unsuspend was more transparent, but I think the effort to automatically unsuspend in a transparent way on any access is enormous compared to the potential value of this feature.

We could make this disable opt-in, but I would guess that most users will not opt-in to disabling an instance. There is zero reason for users to click a "take away my free stuff" button.

We could make the 90-day destruction window longer (365 days, or indefinite) -- once suspended, instances only cost us disk storage -- but I would guess that essentially zero instances will ever request an unsuspend after 90 days. That would mean they went 135 days without any user loading any page on the instance or noticing that the instance didn't even work for the last three months.

Picking a random shard, 80% of the active instances have zero sessions in the last 30 days -- so no user has loaded a single page in at least a month. Most (maybe all?) of these have only one account. These include (anonymized) free instances like "alsdknfasf", "test2", "test2017", "testinstance9", "testforme", "mytest", "myinstance", "bbbbbbb", "ddeemmoo". Of the 20% of active instances which do have activity in the last 30 days, many are recently-launched free instances with only one user session, and I'd guess many of those users are never coming back and these instances will join the boneyard shortly.

I think this is pretty expected, since this (launching a bunch of nonsense test instances in the guise of real instances, then never returning) is exactly what most users did before we had test instances, at roughly similar rates. And there's zero reason for users to launch a test instance instead of a free instance now, since free instances are strictly better, except that we try to push them to launch a test instance in the workflow by giving them a modal "test" choice up front. Even with this workflow hinting, a reasonable number of users are seeing the "test instance" vs "standard" instance prompt, and still using "standard instance" to launch a free instance with a name like "test992992".

I'd say 180 day destruction window, and mostly that's just for us to feel good about deleting stuff.

We can also do other stuff like "never archive an instance with 2+ users", "after an instance emails us to unsuspend, have a flag to mark it as exempt from cleanup in the future", "don't destroy data for 365 days, instead of 90, if an instance had hosted repositories", or other sorts of things like that, but I suspect measures like that will be largely unnecessary for years to come and that "no one loaded a page for 45 days" is already a nearly perfect signal for "no one cares about this instance".

Well, I think it's reasonable to possibly not log in for a while like I keep dormant projects on GH. I don't want them to disappear, but I prefer to keep them in the cloud.

I like your other ideas!

Even with a very conservative policy, we're going to drop 80% of the instances it sounds like.

epriestley added a revision: Restricted Differential Revision.Mar 23 2017, 2:35 PM
epriestley added a commit: Restricted Diffusion Commit.Mar 23 2017, 5:53 PM
epriestley added a revision: Restricted Differential Revision.Apr 6 2017, 10:40 PM
epriestley added a commit: Restricted Diffusion Commit.Apr 6 2017, 10:42 PM

We no longer offer free instances and I don't currently plan to offer them again, so this is moot.