Page MenuHomePhabricator

ClustersInfrastructure
ActivePublic

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

Running Phabricator services on multiple hosts.

Recent Activity

Jun 1 2021

epriestley added a revision to T13614: Provide a write-free, non-locking maintenance window for repositories: D21671: Provide an ad-hoc maintenance lock for clustered repositories.
Jun 1 2021, 3:17 PM · Clusters, Diffusion
epriestley removed a watcher for Clusters: Losy.
Jun 1 2021, 1:59 PM
epriestley added a comment to T13614: Provide a write-free, non-locking maintenance window for repositories.

Since observed repositories version differently today, this strategy won't work -- but I can't come up with any valid reason to ever put a repository into a "write maintenance" mode anyway. I do imagine making observed repositories "replay" fetches into the push log (as though they were pushes) in the future, but that still won't make "write maintenance" on an observed repository meaningful, so it seems fine to just prevent putting non-hosted repositories into this mode.

Jun 1 2021, 1:58 PM · Clusters, Diffusion
epriestley added a revision to T13614: Provide a write-free, non-locking maintenance window for repositories: D21670: Allow maintenance scripts to write synthetic events to the push log that act as repository updates.
Jun 1 2021, 1:50 PM · Clusters, Diffusion
epriestley added a revision to T13614: Provide a write-free, non-locking maintenance window for repositories: D21669: Improve display behavior for write locks held by omnipotent users.
Jun 1 2021, 1:13 PM · Clusters, Diffusion
epriestley added a comment to T13614: Provide a write-free, non-locking maintenance window for repositories.

A minor issue on the way to this is that calling synchronizeWorkingCopyBeforeWrite() with an omnipotent viewer will write to the WorkingCopyVersion table with a null userPHID, which shows as "Unknown Object" in the UI.

Jun 1 2021, 1:08 PM · Clusters, Diffusion
epriestley added a comment to T13614: Provide a write-free, non-locking maintenance window for repositories.

A useful maintenance operation for staging area repositories is to remove out-of-date staging refs: old diffs which have already landed. This is of some particular importance for large installs, since Git has a significant per-ref overhead for many operations until protocol v2: by the time a repository has ~50K refs, interacting with it in basically any way has become slow and cumbersome.

Jun 1 2021, 12:41 PM · Clusters, Diffusion

Mar 16 2021

epriestley moved T13287: Build general healthcheck infrastructure for monitoring services from Backlog to Health / Statistics on the Almanac board.
Mar 16 2021, 5:44 PM · Clusters, Almanac

Mar 11 2021

epriestley moved T12965: When no "master" database is configured, the ElasticSearch setup check can fatal from Backlog to External Search on the Search board.
Mar 11 2021, 5:49 PM · Database, Clusters, Search

Feb 25 2021

epriestley closed T13611: In clusters, "writable" property on bindings may not actually prevent writes as Resolved.
  • This was originally implemented in D19357.
  • This was broken by a refactoring change in D20775.
  • I made problem observable in D21575 fixed it in D21576.
Feb 25 2021, 8:31 PM · Clusters, Diffusion
epriestley added a revision to T13611: In clusters, "writable" property on bindings may not actually prevent writes: D21576: Correct behavior of "writable" Almanac service binding for repository services.
Feb 25 2021, 8:15 PM · Clusters, Diffusion
epriestley added a revision to T13611: In clusters, "writable" property on bindings may not actually prevent writes: D21575: Add an internal service ref panel to repository "Storage" information.
Feb 25 2021, 8:06 PM · Clusters, Diffusion

Feb 24 2021

epriestley updated the task description for T13614: Provide a write-free, non-locking maintenance window for repositories.
Feb 24 2021, 10:10 PM · Clusters, Diffusion

Feb 19 2021

epriestley triaged T13614: Provide a write-free, non-locking maintenance window for repositories as Normal priority.
Feb 19 2021, 4:27 PM · Clusters, Diffusion

Feb 18 2021

epriestley triaged T13611: In clusters, "writable" property on bindings may not actually prevent writes as Low priority.
Feb 18 2021, 11:54 PM · Clusters, Diffusion

Sep 3 2019

epriestley added a comment to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover.

Also remaining is to extend this behavior to the HTTP pathway (and to Mercurial/SVN, eventually).

Sep 3 2019, 7:35 PM · Clusters, Diffusion
epriestley added a revision to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover: D20778: Generalize repository proxy retry logic to writes.
Sep 3 2019, 6:37 PM · Clusters, Diffusion
epriestley added a comment to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover.
  • if we have already retried 3 times, do not retry;
Sep 3 2019, 6:07 PM · Clusters, Diffusion
epriestley added a revision to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover: D20777: Instead of retrying safe reads 3 times, retry each eligible service once.
Sep 3 2019, 5:41 PM · Clusters, Diffusion
epriestley added a revision to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover: D20776: On Git cluster read failure, retry safe requests.
Sep 3 2019, 4:50 PM · Clusters, Diffusion
epriestley added a comment to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover.

we'll reduce silly client-visible behavior where you request /tourtle.git instead of /turtle.git and the server seems confused...

Sep 3 2019, 4:32 PM · Clusters, Diffusion
epriestley added a revision to T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover: D20775: Allow repository service lookups to return an ordered list of service refs.
Sep 3 2019, 3:58 PM · Clusters, Diffusion

Aug 29 2019

epriestley added a comment to T10127: Migrating repository between storage hosts in a cluster.

Not necessarily applicable in the general case, but see also T13393.

Aug 29 2019, 3:09 PM · Clusters, Feature Request

Jul 12 2019

epriestley closed T10127: Migrating repository between storage hosts in a cluster as Resolved.

I assume this is being done already in the Phacility cluster on some level when repositories get really large, but I'm not particularly sure how to perform this migration.

Jul 12 2019, 5:11 PM · Clusters, Feature Request

May 10 2019

epriestley triaged T13287: Build general healthcheck infrastructure for monitoring services as Low priority.
May 10 2019, 5:45 PM · Clusters, Almanac
epriestley added a subtask for T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover: T13287: Build general healthcheck infrastructure for monitoring services.
May 10 2019, 5:45 PM · Clusters, Diffusion
epriestley added parent tasks for T13287: Build general healthcheck infrastructure for monitoring services: T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover, T13285: Service failures in JIRA can cascade into service failures in Phabricator.
May 10 2019, 5:45 PM · Clusters, Almanac
epriestley created T13287: Build general healthcheck infrastructure for monitoring services.
May 10 2019, 5:45 PM · Clusters, Almanac
epriestley triaged T13286: When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover as Normal priority.
May 10 2019, 5:24 PM · Clusters, Diffusion

Apr 15 2019

epriestley moved T13211: Improve intracluster synchronization routing from Backlog to Clusters on the Diffusion board.
Apr 15 2019, 2:45 PM · Clusters, Diffusion

Feb 1 2019

epriestley closed T13192: Inactive repositories can cause "Repository Servers" to always report "Partial Sync?" as Resolved by committing rPf3e154eb02c7: Allow "inactive" repositories to be read over SSH for cluster sync.
Feb 1 2019, 6:12 AM · Clusters, Diffusion

Jan 31 2019

epriestley added a comment to T13192: Inactive repositories can cause "Repository Servers" to always report "Partial Sync?".

See PHI1015 for a slightly meatier explanation of this issue.

Jan 31 2019, 7:47 PM · Clusters, Diffusion
epriestley added a revision to T13192: Inactive repositories can cause "Repository Servers" to always report "Partial Sync?": D20077: Allow "inactive" repositories to be read over SSH for cluster sync.
Jan 31 2019, 7:47 PM · Clusters, Diffusion

Dec 13 2018

epriestley added a comment to T13219: When returning a writable connection as a "r" connection, label it so it can be reused as a "w" connection.

Yeah, that's T10769.

Dec 13 2018, 11:41 PM · Clusters, Infrastructure
joshuaspence added a comment to T13219: When returning a writable connection as a "r" connection, label it so it can be reused as a "w" connection.

I am seeing a similar issue on our install:

[2018-12-13 12:20:12] EXCEPTION: (PhabricatorClusterImproperWriteException) Unable to establish a write-mode connection (to application database "phabricator_repository") because Phabricator is in read-only mode. Whatever you are trying to do does not function correctly in read-only mode. at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:119] arcanist(head=stable, ref.master=d9a4293ae734, ref.stable=45a8d22c74a6), phabricator(head=stable, ref.master=2951694c2737, ref.stable=237a2a190984), phlab(head=master, ref.master=564c60d09ff4), phutil(head=stable, ref.master=dd136d1c3712, ref.stable=414a4c6abb1b)
  #0 PhabricatorLiskDAO::raiseImproperWrite(string) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:60]
  #1 PhabricatorLiskDAO::establishLiveConnection(string) called at [<phabricator>/src/infrastructre/storage/lisk/LiskDAO.php:1011]
  #2 LiskDAO::establishConnection(string) called at [<phabricator>/src/applications/repository/stora... (619 more bytes) ... at [<phutil>/src/future/exec/ExecFuture.php:380]
[13-Dec-2018 12:20:12 Etc/UTC] arcanist(head=stable, ref.master=d9a4293ae734, ref.stable=45a8d22c74a6), phabricator(head=stable, ref.master=2951694c2737, ref.stable=237a2a190984), phlab(head=master, ref.master=564c60d09ff4), phutil(head=stable, ref.master=dd136d1c3712, ref.stable=414a4c6abb1b)
[13-Dec-2018 12:20:12 Etc/UTC]   #0 <#3> ExecFuture::resolvex() called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:446]
[13-Dec-2018 12:20:12 Etc/UTC]   #1 phlog(PhutilProxyException) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:453]
[13-Dec-2018 12:20:12 Etc/UTC]   #2 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(PhabricatorRepository, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:222]
[13-Dec-2018 12:20:12 Etc/UTC]   #3 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:219]
[13-Dec-2018 12:20:12 Etc/UTC]   #4 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:131]
Dec 13 2018, 11:40 PM · Clusters, Infrastructure
joshuaspence added a comment to T13219: When returning a writable connection as a "r" connection, label it so it can be reused as a "w" connection.

I am seeing a similar issue on our install:

Dec 13 2018, 11:38 PM · Clusters, Infrastructure

Nov 21 2018

epriestley updated the task description for T13219: When returning a writable connection as a "r" connection, label it so it can be reused as a "w" connection.
Nov 21 2018, 4:52 PM · Clusters, Infrastructure
epriestley added a parent task for T13219: When returning a writable connection as a "r" connection, label it so it can be reused as a "w" connection: T11908: Support an "overlay" database connection mode where multiple applications share a single connection.
Nov 21 2018, 4:18 PM · Clusters, Infrastructure
epriestley triaged T13219: When returning a writable connection as a "r" connection, label it so it can be reused as a "w" connection as Low priority.
Nov 21 2018, 4:18 PM · Clusters, Infrastructure

Oct 8 2018

epriestley triaged T13211: Improve intracluster synchronization routing as Normal priority.
Oct 8 2018, 7:29 PM · Clusters, Diffusion

Oct 5 2018

epriestley added a revision to T10884: Sort repository, database and notification services better (by network distance): D19735: Explicitly shuffle nodes before selecting one for cluster sync.
Oct 5 2018, 9:01 PM · Clusters

Sep 6 2018

epriestley added a comment to T10884: Sort repository, database and notification services better (by network distance).

See PHI860 and T13111. In the future, repository nodes may automatically gc/prune/repack. If they do, it may make sense to sort them to the bottom of the list so traffic is sent to them only if no other nodes are available, in order to minimize the impact that gc/prune/repack have on other activity.

Sep 6 2018, 6:43 PM · Clusters

Aug 27 2018

epriestley triaged T13192: Inactive repositories can cause "Repository Servers" to always report "Partial Sync?" as Low priority.
Aug 27 2018, 5:28 PM · Clusters, Diffusion

Jun 5 2018

joshuaspence added a member for Clusters: joshuaspence.
Jun 5 2018, 10:45 PM

Apr 12 2018

epriestley closed T10883: Allow repository cluster nodes to be read-only as Resolved by committing rP6556536d0615: Allow repository cluster bindings to be marked as not "writable", making them….
Apr 12 2018, 11:10 PM · Restricted Project, Diffusion, Clusters
epriestley added a revision to T10883: Allow repository cluster nodes to be read-only: D19357: Allow repository cluster bindings to be marked as not "writable", making them read-only.
Apr 12 2018, 9:09 PM · Restricted Project, Diffusion, Clusters
epriestley added a revision to T10883: Allow repository cluster nodes to be read-only: D19356: Give getAlmanacServiceURI() an "options" parameter to prepare for read-only devices.
Apr 12 2018, 9:00 PM · Restricted Project, Diffusion, Clusters
epriestley added a revision to T10883: Allow repository cluster nodes to be read-only: D19355: Turn the "closed" property on cluster repositories into a nice boolean.
Apr 12 2018, 8:38 PM · Restricted Project, Diffusion, Clusters

Feb 22 2018

epriestley renamed T13089: A full disk on a read replica database host can cause far-reaching request slowness? from A full disk on a read replica can cause far-reaching request slowness? to A full disk on a read replica database host can cause far-reaching request slowness?.
Feb 22 2018, 5:48 PM · Clusters, Infrastructure
epriestley renamed T13089: A full disk on a read replica database host can cause far-reaching request slowness? from A full disk on a read replica can kill everything to A full disk on a read replica can cause far-reaching request slowness?.
Feb 22 2018, 5:48 PM · Clusters, Infrastructure