diff --git a/src/docs/user/cluster/cluster_repositories.diviner b/src/docs/user/cluster/cluster_repositories.diviner --- a/src/docs/user/cluster/cluster_repositories.diviner +++ b/src/docs/user/cluster/cluster_repositories.diviner @@ -19,19 +19,19 @@ This configuration is complex, and many installs do not need to pursue it. -This configuration is not currently supported with Subversion. +This configuration is not currently supported with Subversion or Mercurial. Repository Hosts ================ Repository hosts must run a complete, fully configured copy of Phabricator, -including a webserver. If you make repositories available over SSH, they must -also run a properly configured `sshd`. +including a webserver. They must also run a properly configured `sshd`. Generally, these hosts will run the same set of services and configuration that web hosts run. If you prefer, you can overlay these services and put web and -repository services on the same hosts. +repository services on the same hosts. See @{article:Clustering Introduction} +for some guidance on overlaying services. When a user requests information about a repository that can only be satisfied by examining a repository working copy, the webserver receiving the request @@ -57,6 +57,17 @@ Before responding to a write, replicas obtain a global lock, perform the same version check and fetch if necessary, then allow the write to continue. +Additionally, repositories passively check other nodes for updates and +replicate changes in the background. After you push a change to a repositroy, +it will usually spread passively to all other repository nodes within a few +minutes. + +Even if passive replication is slow, the active replication makes acknowledged +changes sequential to all observers: after a write is acknowledged, all +subsequent reads are guaranteed to see it. The system does not permit stale +reads, and you do not need to wait for a replication delay to see a consistent +view of the repository no matter which node you ask. + HTTP vs HTTPS ============= @@ -84,6 +95,81 @@ similar agents of other rogue nations is beyond the scope of this document. +Monitoring Replication +====================== + +You can review the current status of a repository on cluster nodes in +{nav Diffusion > (Repository) > Manage Repository > Cluster Configuration}. + +This screen shows all the configured devices which are hosting the repository +and the available version. + +**Version**: When a repository is mutated by a push, Phabricator increases +an internal version number for the repository. This column shows which version +is on disk on the corresponding node. + +After a change is pushed, the node which received the change will have a larger +version number than the other nodes. The change should be passively replicated +to the remaining nodes after a brief period of time, although this can take +a while if the change was large or the network connection between nodes is +slow or unreliable. + +You can click the version number to see the corresponding push logs for that +change. The logs contain details about what was changed, and can help you +identify if replication is slow because a change is large or for some other +reason. + +**Writing**: This shows that the node is currently holding a write lock. This +normally means that it is actively receiving a push, but can also mean that +there was a write interruption. See "Write Interruptions" below for details. + + +Write Interruptions +=================== + +A repository cluster can be put into an inconsistent state by an interruption +in a brief window immediately after a write. + +Phabricator can not commit changes to a working copy (stored on disk) and to +the global state (stored in a database) atomically, so there is a narrow window +between committing these two different states when some tragedy (like a +lightning strike) can befall a server, leaving the global and local views of +the repository state divergent. + +In these cases, Phabricator fails into a "frozen" state where further writes +are not permitted until the failure is investigated and resolved. + +TODO: Complete the support tooling and provide recovery instructions. + + +Loss of Leaders +=============== + +A more straightforward failure condition is the loss of all servers in a +cluster which have the most up-to-date copy of a repository. This looks like +this: + + - There is a cluster setup with two nodes, X and Y. + - A new change is pushed to server X. + - Before the change can propagate to server Y, lightning strikes server X + and destroys it. + +Here, all of the "leader" nodes with the most up-to-date copy of the repository +have been lost. Phabricator will refuse to serve this repository because it +can not serve it consistently, and can not accept writes without data loss. + +The most straightforward way to resolve this issue is to restore any leader to +service. The change will be able to replicate to other nodes once a leader +comes back online. + +If you are unable to restore a leader or unsure that you can restore one +quickly, you can use the monitoring console to review which changes are +present on the leaders but not present on the followers by examining the +push logs. + +TODO: Complete the support tooling and provide recovery instructions. + + Backups ======