Reasonable steps for inching toward full repository clustering today (May/June 2016).
First, configure a one-node cluster: one service, one device (the same box you currently host repositories on). The setup instructions should be fairly helpful about this. Clusterize one repository, test it. If things look good, clusterize everything else. (Note that as long as the service exists, new repositories will allocate on it.)
(If things don't look good, declusterize everything (bin/repository clusterize --remove-service), then bin/remove destroy the Almanac service to get back to normal.)
If repositories are observed (hosted elsewhere) rather than hosted (hosted by Phabricator), stop here if T4292 hasn't been closed yet.
Create a second, two-node cluster service: two hosts. One is the same box as now, the other is a replica. On the Service definition, set closed to true to prevent new repositories from allocating here.
Move one repository by doing repository clusterize --remove-service to pull it out of the cluster, then --service <whatever> to put it into the two-node service. Kick the tires; if things look good, move more repositories.
(If things don't look good, just clusterize them back to the old service and use bin/repository thaw --demote second.node.company.net if necessary to unfreeze anything that got frozen, although I think this shouldn't be possible.)
Once you're ready to start allocating new repositories on the two-node service, open it up again and set the one-node service to closed.
Once all repositories are on the two-node service, you're all set. You can expand the service normally if you want to add more nodes.
Throughout this process, the major limitations today are:
- Manage Repository → Storage is accurate (and likely to be helpful), but not editable. You can "edit" it with the bin/repository clusterize dance above.
- New repositories allocate on a random open Almanac Service, and you can't change this behavior today.
- If any service is configured, failing to find a service to allocate on is an error, and you can't change this behavior. You have to destroy the service to get back to non-cluster mode.
- Hosted-elsewhere repositories have a versioning/clock issue on multiple nodes until T4292 finishes.
But these helpful properties allow this transition to be relatively reasonable:
- It's fine for a single device to host both non-cluster and cluster repositories.
- It's fine for a single device to be in multiple services.