The Phacility cluster has grown since launch, and hardware growth (partly driven by free instances, discussed in T12217) has accelerated recently. Currently, the production cluster is ~50 total devices. T12217 discusses some ways in which we can slow hardware growth, but even if those are fully implemented, the cluster isn't getting any smaller.
T7346 is an old task (from around launch) discussing general scalability concerns. These are currently the three most urgent concerns:
Operational Tools: We don't have a hypershell, and all cluster interaction is 1:1. Deploying the cluster involves opening up a whole pile of console windows and manually entering 100+ commands to deploy and verify everything. This is error prone and will become more error prone as the cluster continues to grow.
Beyond all the typing being error prone, there's a lack of centralized deployment logging or general encapsulation of an operational session: there's no way to go review how a deployment went, which commands were run, which errors occurred, etc.
The ideal fix is Phage ("Hypershell 2"), which we know to be a flexible tool at large scale, but it ain't super cheap to build. It would give us hypershell capabilities as part of the core, though. Phage/Hypershell are also very cool.
A less-ideal fix is a bin/remote deploy-everything command which at least fixes all the typing. This is cheaper to build.
Monitoring: Monitoring is currently very limited, and some monitoring signals (notably, free disk space and deployed versions) are on-demand. There is no screen where you can see which disks across the entire cluster are near capacity (or if any hosts have the wrong version of stuff) at a glance.
Some other things, like the instance allocator, could take advantage of this if it was centralized. The allocator currently considers shards "full" based only on how many active instances they have. It would be more powerful if it could also consider other signals, like drive fullness.
I'd ideally like to build something like Reticle integrated with Almanac and Facts: allow devices and services to have information pushed or pulled, then chart it and expose it to other consumers. This would give us some monitoring support as part of the core. I'm not sure anyone but us would use it, but integrated monitoring is so operationally useful that I think this should probably be first-party even if it sees no other use.
Disk Operations / Shard Migrations: Increasing the size of disks is manual and requires significant, uncommunicated downtime. This will be more of a problem as we start packing more instances per host.
Moving instances between shards is manual and requires significant, uncommunicated downtime.
At a minimum, these operations should become automated and communicated. Ideally, they should also become faster / less disruptive, although they aren't terribly common and it's not the end of the world if they don't get much better than they currently are. By design, we have no individual datastores which require enormously long amounts of time to copy.