diff --git a/src/docs/user/cluster/cluster_repositories.diviner b/src/docs/user/cluster/cluster_repositories.diviner index c5179666a7..eb9a4f4ede 100644 --- a/src/docs/user/cluster/cluster_repositories.diviner +++ b/src/docs/user/cluster/cluster_repositories.diviner @@ -1,112 +1,198 @@ @title Cluster: Repositories @group intro Configuring Phabricator to use multiple repository hosts. Overview ======== WARNING: This feature is a very early prototype; the features this document describes are mostly speculative fantasy. If you use Git or Mercurial, you can deploy Phabricator with multiple repository hosts, configured so that each host is readable and writable. The advantages of doing this are: - you can completely survive the loss of repository hosts; - reads and writes can scale across multiple machines; and - read and write performance across multiple geographic regions may improve. This configuration is complex, and many installs do not need to pursue it. -This configuration is not currently supported with Subversion. +This configuration is not currently supported with Subversion or Mercurial. Repository Hosts ================ Repository hosts must run a complete, fully configured copy of Phabricator, -including a webserver. If you make repositories available over SSH, they must -also run a properly configured `sshd`. +including a webserver. They must also run a properly configured `sshd`. Generally, these hosts will run the same set of services and configuration that web hosts run. If you prefer, you can overlay these services and put web and -repository services on the same hosts. +repository services on the same hosts. See @{article:Clustering Introduction} +for some guidance on overlaying services. When a user requests information about a repository that can only be satisfied by examining a repository working copy, the webserver receiving the request will make an HTTP service call to a repository server which hosts the repository to retrieve the data it needs. It will use the result of this query to respond to the user. How Reads and Writes Work ========================= Phabricator repository replicas are multi-master: every node is readable and writable, and a cluster of nodes can (almost always) survive the loss of any arbitrary subset of nodes so long as at least one node is still alive. Phabricator maintains an internal version for each repository, and increments it when the repository is mutated. Before responding to a read, replicas make sure their version of the repository is up to date (no node in the cluster has a newer version of the repository). If it isn't, they block the read until they can complete a fetch. Before responding to a write, replicas obtain a global lock, perform the same version check and fetch if necessary, then allow the write to continue. +Additionally, repositories passively check other nodes for updates and +replicate changes in the background. After you push a change to a repositroy, +it will usually spread passively to all other repository nodes within a few +minutes. + +Even if passive replication is slow, the active replication makes acknowledged +changes sequential to all observers: after a write is acknowledged, all +subsequent reads are guaranteed to see it. The system does not permit stale +reads, and you do not need to wait for a replication delay to see a consistent +view of the repository no matter which node you ask. + HTTP vs HTTPS ============= Intracluster requests (from the daemons to repository servers, or from webservers to repository servers) are permitted to use HTTP, even if you have set `security.require-https` in your configuration. It is common to terminate SSL at a load balancer and use plain HTTP beyond that, and the `security.require-https` feature is primarily focused on making client browser behavior more convenient for users, so it does not apply to intracluster traffic. Using HTTP within the cluster leaves you vulnerable to attackers who can observe traffic within a datacenter, or observe traffic between datacenters. This is normally very difficult, but within reach for state-level adversaries like the NSA. If you are concerned about these attackers, you can terminate HTTPS on repository hosts and bind to them with the "https" protocol. Just be aware that the `security.require-https` setting won't prevent you from making configuration mistakes, as it doesn't cover intracluster traffic. Other mitigations are possible, but securing a network against the NSA and similar agents of other rogue nations is beyond the scope of this document. +Monitoring Replication +====================== + +You can review the current status of a repository on cluster nodes in +{nav Diffusion > (Repository) > Manage Repository > Cluster Configuration}. + +This screen shows all the configured devices which are hosting the repository +and the available version. + +**Version**: When a repository is mutated by a push, Phabricator increases +an internal version number for the repository. This column shows which version +is on disk on the corresponding node. + +After a change is pushed, the node which received the change will have a larger +version number than the other nodes. The change should be passively replicated +to the remaining nodes after a brief period of time, although this can take +a while if the change was large or the network connection between nodes is +slow or unreliable. + +You can click the version number to see the corresponding push logs for that +change. The logs contain details about what was changed, and can help you +identify if replication is slow because a change is large or for some other +reason. + +**Writing**: This shows that the node is currently holding a write lock. This +normally means that it is actively receiving a push, but can also mean that +there was a write interruption. See "Write Interruptions" below for details. + + +Write Interruptions +=================== + +A repository cluster can be put into an inconsistent state by an interruption +in a brief window immediately after a write. + +Phabricator can not commit changes to a working copy (stored on disk) and to +the global state (stored in a database) atomically, so there is a narrow window +between committing these two different states when some tragedy (like a +lightning strike) can befall a server, leaving the global and local views of +the repository state divergent. + +In these cases, Phabricator fails into a "frozen" state where further writes +are not permitted until the failure is investigated and resolved. + +TODO: Complete the support tooling and provide recovery instructions. + + +Loss of Leaders +=============== + +A more straightforward failure condition is the loss of all servers in a +cluster which have the most up-to-date copy of a repository. This looks like +this: + + - There is a cluster setup with two nodes, X and Y. + - A new change is pushed to server X. + - Before the change can propagate to server Y, lightning strikes server X + and destroys it. + +Here, all of the "leader" nodes with the most up-to-date copy of the repository +have been lost. Phabricator will refuse to serve this repository because it +can not serve it consistently, and can not accept writes without data loss. + +The most straightforward way to resolve this issue is to restore any leader to +service. The change will be able to replicate to other nodes once a leader +comes back online. + +If you are unable to restore a leader or unsure that you can restore one +quickly, you can use the monitoring console to review which changes are +present on the leaders but not present on the followers by examining the +push logs. + +TODO: Complete the support tooling and provide recovery instructions. + + Backups ====== Even if you configure clustering, you should still consider retaining separate backup snapshots. Replicas protect you from data loss if you lose a host, but they do not let you rewind time to recover from data mutation mistakes. If something issues a `--force` push that destroys branch heads, the mutation will propagate to the replicas. You may be able to manually restore the branches by using tools like the Phabricator push log or the Git reflog so it is less important to retain repository snapshots than database snapshots, but it is still possible for data to be lost permanently, especially if you don't notice the problem for some time. Retaining separate backup snapshots will improve your ability to recover more data more easily in a wider range of disaster situations. Next Steps ========== Continue by: - returning to @{article:Clustering Introduction}.