Change Details

Task documenting possible scaling challenges in the cluster. Overview ====== | Concern | Severity | Urgency | Task | Notes | |---|---|---|---|---| | Disk Space | Medium | Very Soon | | Adding storage requires downtime; poor monitoring tools. | | Daemons | High | Very Soon | | Daemon process count of low-activity instances may not be scalable. | | Monitoring | Unknown | Very Soon | T7338 | Limited monitoring of (and alerting for) cluster health. | | Push Durations | Medium | Soon | | Pushes stop the world and have costs proportional to instance count. | | Shard Spreading | Medium | Soon | | Moving instances off hot shards is very manual. | | Vault Tier | Medium | Future | | Scaling SSH LBs is easy but very manual. | | Admin Services | High | Future | | Requires some work to scale. | | Notification Tier | Very Low | Far Future | | Tier will scale easily but involves manual steps. | | Multiple Repo Services | Low | Far Future | T2783 | Requires some work. | | Repo Clusters | Low | Far Future | T4292 | Requires substantial work. | | Database Clusters | Very Low | Far Future | | Requires substantial work. | | Notification Clusters | Very Low | Far Future | T6915 | Requires some work. | Disk Space ===== Storage is cheap, but not so cheap that we can run everything on gigantically huge disks and ignore it. In particular, attaching a 1TB EBS to a c3.large instance results in approximately equal compute and storage costs for the host. If we just over-allocated we could easily end up with our hard costs dominated by storage, so it's desirable to match allocated storage at least roughly to actual storage needs. The severity of running out of storage is very high (instance malfunction, and possibly data loss). We have very limited tools for monitoring storage use. Demand for storage can be bursty: most storage is consumed when instances import repositories, so demand can jump suddenly if an instance with a large amount of repository data signs up. To increase the size of a disk, we need to detach it, copy it, and then reattach the new disk (there's no support for just setting the disk to a new larger size). This implies downtime for the associated device. Until we build repo and database clusters, this will also imply downtime for the associated service if the disk is an `rdata` or `ddata` class (we can swap backups without meaningfully interrupting service, and will not need to swap `adata` for many years). This process is currently completely manual. We have limited tooling to communicate these kinds of service interruptions. This install has approximately 100MB of repositories and 2GB of data, which we accumulated pretty gradually over the course of several years. If most instances are roughly like this, the current storage scheme won't be problematic for a long time and we'll easily be able to predict storage needs, schedule maintenance, and stay ahead of this. I suspect many instances will be nothing like this, though, and they may have huge amounts of repository data they want to import and have all sort of use cases where they upload many many gigs of things to Phabricator. We also currently retain backups indefinitely, which is obviously unscalable. Roughly: - We need finite backup retention and automatic cycling of old backups. - We need to improve monitoring. AWS gives us a bunch of lower-level monitoring, but we'll need to do at least some work to get disk space, and at least some work to bring it into Phabricator. I don't have a clear plan for how to best do this in a way that moves us directly toward where we want to be in the long run. - We need an automated storage replacement workflow which can unmount, copy, and swap the drives. - We need better feedback mechanisms so tools can mark services down and the web UI can reflect the down state. - We could mitigate burstiness with advice ("please get in touch with us if you have more than X users"), tooling ("This repository service device has X GB of free space" when importing a new repository) or quotas (but I'd rather not deal with hard quotas). Daemons ======== Each instance runs 6 + (2 * taskmasters) daemon processes all the time, so 100 instances need about 1,000 daemon processes. This is fine if these are large instances spread across a lot of boxes, but if we have a lot of small, inactive instances we may end up needing to allocate repo hosts just to have enough resources to leave daemons sleeping all the time (presumably, we'll run out of RAM first -- each process is about 30MB, so 1K processes needs 30GB to stay out of swap). The current architecture also means that we have to run a fixed number of taskmasters per instance. It would be nice to have a maximum count instead (maybe proportional to number of users on the instance) and be able to scale up when the queue is busy and down when it isn't. Probably: - Merge the trigger and GC daemons. There's no reason these need to be separate daemons, and the GC stuff is very similar to the trigger stuff. - Rewrite the overseer to manage all the daemons, not just one. The big barrier here is that IPC between the overseer and daemons is currently very primitive and will need to be made more sophisticated. - Let the overseer autoscale taskmasters. That would give us 4 processes per inactive instance (overseer, pull, trigger, taskmaster) instead of 14. I don't see an easy way to further reduce this, but this is more in line with the other levels of resource usage I anticipate from average instances. Monitoring ======== We have limited monitoring and alerting for cluster health. Until the world is more stable and the cluster is larger I suspect random hardware failure will have very little impact on service availability, but at some point things will be automated and stable and surprising failures will account for more problems than they do today. We have some problems in the short term (notably, disk space) that we do need monitoring for sooner. Push Durations =========== Pushing the database and repository tiers has a high, linear cost per instance, as we must upgrade each database and restart daemons for each instance. If we have a large number of instances, these costs could get out of control. This is only really a problem because we have to stop the world to do upgrades. If migrations always had some commit which was safe to run against both the pre-migration and post-migration schema (in the complex cases, this would imply double writes) we could just bring instances down briefly during upgrades. Shard Spreading ============ We don't have tooling for moving large instances off hot shards onto cold shards. This can be done manually, but would be slow/error-prone and interrupt service. Vault Tier ======= The list of `web` class machines for the `vault` hosts to use as backends is not generated automatically. It should be. `vault` hosts need to be load balanced with DNS, which is more complicated and less powerful than ELB load balancing. I don't anticipate this being a problem for a very long time, though. Admin Services ============ Some work is required to scale the `admin` device class across multiple hosts. Currently, load on this host per instance is very low, and we likely have years of headroom, but this may change if we start pushing a bunch of metrics to `admin`. Notification Tier ============ This is very easy to scale, but we need a bit of work to bucket instances and send them to different NLBs.

Task documenting possible scaling challenges in the cluster. Overview ====== | Concern | Severity | Urgency | Task | Notes | |---|---|---|---|---| | Disk Space | Medium | Very Soon | T7351 | Adding storage requires downtime; poor monitoring tools. | | Daemons | High | Very Soon | T7352 | Daemon process count of low-activity instances may not be scalable. | | Monitoring | Unknown | Very Soon | T7338 | Limited monitoring of (and alerting for) cluster health. | | Push Durations | Medium | Soon | | Pushes stop the world and have costs proportional to instance count. | | Shard Spreading | Medium | Soon | | Moving instances off hot shards is very manual. | | Vault Tier | Medium | Future | | Scaling SSH LBs is easy but very manual. | | Admin Services | High | Future | | Requires some work to scale. | | Notification Tier | Very Low | Far Future | | Tier will scale easily but involves manual steps. | | Multiple Repo Services | Low | Far Future | T2783 | Requires some work. | | Repo Clusters | Low | Far Future | T4292 | Requires substantial work. | | Database Clusters | Very Low | Far Future | | Requires substantial work. | | Notification Clusters | Very Low | Far Future | T6915 | Requires some work. | Disk Space ===== Storage is cheap, but not so cheap that we can run everything on gigantically huge disks and ignore it. In particular, attaching a 1TB EBS to a c3.large instance results in approximately equal compute and storage costs for the host. If we just over-allocated we could easily end up with our hard costs dominated by storage, so it's desirable to match allocated storage at least roughly to actual storage needs. The severity of running out of storage is very high (instance malfunction, and possibly data loss). We have very limited tools for monitoring storage use. Demand for storage can be bursty: most storage is consumed when instances import repositories, so demand can jump suddenly if an instance with a large amount of repository data signs up. To increase the size of a disk, we need to detach it, copy it, and then reattach the new disk (there's no support for just setting the disk to a new larger size). This implies downtime for the associated device. Until we build repo and database clusters, this will also imply downtime for the associated service if the disk is an `rdata` or `ddata` class (we can swap backups without meaningfully interrupting service, and will not need to swap `adata` for many years). This process is currently completely manual. We have limited tooling to communicate these kinds of service interruptions. This install has approximately 100MB of repositories and 2GB of data, which we accumulated pretty gradually over the course of several years. If most instances are roughly like this, the current storage scheme won't be problematic for a long time and we'll easily be able to predict storage needs, schedule maintenance, and stay ahead of this. I suspect many instances will be nothing like this, though, and they may have huge amounts of repository data they want to import and have all sort of use cases where they upload many many gigs of things to Phabricator. We also currently retain backups indefinitely, which is obviously unscalable. Roughly: - We need finite backup retention and automatic cycling of old backups. - We need to improve monitoring. AWS gives us a bunch of lower-level monitoring, but we'll need to do at least some work to get disk space, and at least some work to bring it into Phabricator. I don't have a clear plan for how to best do this in a way that moves us directly toward where we want to be in the long run. - We need an automated storage replacement workflow which can unmount, copy, and swap the drives. - We need better feedback mechanisms so tools can mark services down and the web UI can reflect the down state. - We could mitigate burstiness with advice ("please get in touch with us if you have more than X users"), tooling ("This repository service device has X GB of free space" when importing a new repository) or quotas (but I'd rather not deal with hard quotas). Daemons ======== Each instance runs 6 + (2 * taskmasters) daemon processes all the time, so 100 instances need about 1,000 daemon processes. This is fine if these are large instances spread across a lot of boxes, but if we have a lot of small, inactive instances we may end up needing to allocate repo hosts just to have enough resources to leave daemons sleeping all the time (presumably, we'll run out of RAM first -- each process is about 30MB, so 1K processes needs 30GB to stay out of swap). The current architecture also means that we have to run a fixed number of taskmasters per instance. It would be nice to have a maximum count instead (maybe proportional to number of users on the instance) and be able to scale up when the queue is busy and down when it isn't. Probably: - Merge the trigger and GC daemons. There's no reason these need to be separate daemons, and the GC stuff is very similar to the trigger stuff. - Rewrite the overseer to manage all the daemons, not just one. The big barrier here is that IPC between the overseer and daemons is currently very primitive and will need to be made more sophisticated. - Let the overseer autoscale taskmasters. That would give us 4 processes per inactive instance (overseer, pull, trigger, taskmaster) instead of 14. I don't see an easy way to further reduce this, but this is more in line with the other levels of resource usage I anticipate from average instances. Monitoring ======== We have limited monitoring and alerting for cluster health. Until the world is more stable and the cluster is larger I suspect random hardware failure will have very little impact on service availability, but at some point things will be automated and stable and surprising failures will account for more problems than they do today. We have some problems in the short term (notably, disk space) that we do need monitoring for sooner. Push Durations =========== Pushing the database and repository tiers has a high, linear cost per instance, as we must upgrade each database and restart daemons for each instance. If we have a large number of instances, these costs could get out of control. This is only really a problem because we have to stop the world to do upgrades. If migrations always had some commit which was safe to run against both the pre-migration and post-migration schema (in the complex cases, this would imply double writes) we could just bring instances down briefly during upgrades. Shard Spreading ============ We don't have tooling for moving large instances off hot shards onto cold shards. This can be done manually, but would be slow/error-prone and interrupt service. Vault Tier ======= The list of `web` class machines for the `vault` hosts to use as backends is not generated automatically. It should be. `vault` hosts need to be load balanced with DNS, which is more complicated and less powerful than ELB load balancing. I don't anticipate this being a problem for a very long time, though. Admin Services ============ Some work is required to scale the `admin` device class across multiple hosts. Currently, load on this host per instance is very low, and we likely have years of headroom, but this may change if we start pushing a bunch of metrics to `admin`. Notification Tier ============ This is very easy to scale, but we need a bit of work to bucket instances and send them to different NLBs.