Differential D15766 Diff 37999 src/docs/user/cluster/cluster_repositories.diviner

Changeset View

Standalone View

src/docs/user/cluster/cluster_repositories.diviner

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	change. The logs contain details about what was changed, and can help you			change. The logs contain details about what was changed, and can help you
	identify if replication is slow because a change is large or for some other			identify if replication is slow because a change is large or for some other
	reason.			reason.

	Writing: This shows that the node is currently holding a write lock. This			Writing: This shows that the node is currently holding a write lock. This
	normally means that it is actively receiving a push, but can also mean that			normally means that it is actively receiving a push, but can also mean that
	there was a write interruption. See "Write Interruptions" below for details.			there was a write interruption. See "Write Interruptions" below for details.

				Last Writer: This column identifies the user who most recently pushed a
				change to this device. If the write lock is currently held, this user is
				the user whose change is holding the lock.

				Last Write At: When the most recent write started. If the write lock is
				currently held, this shows when the lock was acquired.


	Write Interruptions			Write Interruptions
	===================			===================

	A repository cluster can be put into an inconsistent state by an interruption			A repository cluster can be put into an inconsistent state by an interruption
	in a brief window immediately after a write.			in a brief window during and immediately after a write.

	Phabricator can not commit changes to a working copy (stored on disk) and to			Phabricator can not commit changes to a working copy (stored on disk) and to
	the global state (stored in a database) atomically, so there is a narrow window			the global state (stored in a database) atomically, so there is a narrow window
	between committing these two different states when some tragedy (like a			between committing these two different states when some tragedy (like a
	lightning strike) can befall a server, leaving the global and local views of			lightning strike) can befall a server, leaving the global and local views of
	the repository state divergent.			the repository state possibly divergent.

	In these cases, Phabricator fails into a "frozen" state where further writes			In these cases, Phabricator fails into a frozen state where further writes
	are not permitted until the failure is investigated and resolved.			are not permitted until the failure is investigated and resolved.

	TODO: Complete the support tooling and provide recovery instructions.			You can use the monitoring console to review the state of a frozen repository
				with a held write lock. The Writing column will show which node is holding
				the lock, and whoever is named in the Last Writer column may be able to
				help you figure out what happened by providing more information about what they
				were doing and what they observed.

				Because the push was not acknowledged, it is normally safe to demote the node:
				the user should have received an error anyway, and should not expect their push
				to have worked. However, data is technically at risk and you may want to
				investigate further and try to understand the issue in more detail before
				continuing.

				There is no way to explicitly keep the write, but if it was committed to disk
				you can recover it manually from the working copy on the device and then push
				it again.

				If you demote the node, the in-process write will be thrown away, even if it
				was complete on disk. To demote the node and release the write lock, run this
				command:

				```
				phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net
				```

				{icon exclamation-triangle, color="yellow"} Any committed but unacknowledged
				data on the device will be lost.


	Loss of Leaders			Loss of Leaders
	===============			===============

	A more straightforward failure condition is the loss of all servers in a			A more straightforward failure condition is the loss of all servers in a
	cluster which have the most up-to-date copy of a repository. This looks like			cluster which have the most up-to-date copy of a repository. This looks like
	this:			this:
	Show All 11 Lines
	service. The change will be able to replicate to other nodes once a leader			service. The change will be able to replicate to other nodes once a leader
	comes back online.			comes back online.

	If you are unable to restore a leader or unsure that you can restore one			If you are unable to restore a leader or unsure that you can restore one
	quickly, you can use the monitoring console to review which changes are			quickly, you can use the monitoring console to review which changes are
	present on the leaders but not present on the followers by examining the			present on the leaders but not present on the followers by examining the
	push logs.			push logs.

	TODO: Complete the support tooling and provide recovery instructions.			If you are comfortable discarding these changes, you can instruct Phabricator
				that it can forget about the leaders in two ways: disable the service bindings
				to all of the leader nodes so they are no longer part of the cluster, or
				use `bin/repository thaw` to `--demote` the leaders explicitly.

				If you do this, you will lose data. Either action will discard any changes
				on the affected leaders which have not replicated to other nodes in the cluster.

				To demote a device, run this command:

				```
				phabricator/ $ ./bin/repository thaw rXYZ --demote repo002.corp.net
				```

				{icon exclamation-triangle, color="red"} Any data which is only present on
				this device will be lost.


				Ambiguous Leaders
				=================

				Repository clusters can also freeze if the leader nodes are ambiguous. This
				can happen if you replace an entire cluster with new devices suddenly, or
				make a mistake with the `--demote` flag.

				When Phabricator can not tell which node in a cluster is a leader, it freezes
				the cluster because it is possible that some nodes have less data and others
				have more, and if it choses a leader arbitrarily it may destroy some data
				which you would prefer to retain.

				To resolve this, you need to tell Phabricator which node has the most
				up-to-date data and promote that node to become a leader. If you do this,
				you may lose data if you promote the wrong node, and some other node
				really had more up-to-date data. If you want to double check, you can examine
				the working copies on disk before promoting, by connecting to the machines and
				using commands like `git log` to inspect state.

				Once you have identified a node which has data you're happy with, use
				`bin/repository thaw` to `--promote` the device:

				```
				phabricator/ $ ./bin/repository thaw rXYZ --promote repo002.corp.net
				```

				{icon exclamation-triangle, color="red"} Any data which is only present on
				other devices will be lost.


	Backups			Backups
	======			======

	Even if you configure clustering, you should still consider retaining separate			Even if you configure clustering, you should still consider retaining separate
	backup snapshots. Replicas protect you from data loss if you lose a host, but			backup snapshots. Replicas protect you from data loss if you lose a host, but
	they do not let you rewind time to recover from data mutation mistakes.			they do not let you rewind time to recover from data mutation mistakes.
	Show All 20 Lines