Paths

Table of Contentst

Diffusion Phabricator ebff07d01983

Automatically sever databases after prolonged unreachability
ebff07d01983
Actions

Tags

None

Referenced Files

	F1213397: Screen Shot 2016-04-10 at 2.50.41 PM.png
	Apr 11 2016, 3:43 PM

Subscribers

Description

Automatically sever databases after prolonged unreachability

Summary:
Ref T4571. When a database goes down briefly, we fall back to replicas.

However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related).

Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database).

We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed.

This is similar to what most load balancers do when pulling web servers in and out of pools.

For now, the specific numbers are:

We do at most one health check every 3 seconds.
If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes).
If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it.

Test Plan:

Configured a bad master.
Browsed around for a bit, initially saw "unrechable master" errors.
After about 15 seconds, saw "major interruption" errors instead.
Fixed the config for master.
Browsed around for a while longer.
After about 15 seconds, things recovered.
Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good:

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15677

Details

Provenance

epriestley	Authored on Apr 10 2016, 9:18 PM
epriestley	Pushed on Apr 11 2016, 3:43 PM

Reviewer

Differential Revision

D15677: Automatically sever databases after prolonged unreachability

Parents

rP5cf09f567a98: Fix an issue with date parsing when viewer timezone differs from server timezone

Branches

Unknown

Tags

Unknown

Tasks

T4571: Allow Phabricator to run in Read-Only Mode

Build Status

Buildable 11652
Build 14577: Run Core Tests

Event Timeline

epriestley committed rPebff07d01983: Automatically sever databases after prolonged unreachability (authored by epriestley).Apr 11 2016, 3:43 PM

epriestley added a task: T4571: Allow Phabricator to run in Read-Only Mode.

Harbormaster failed to build B11652: rPebff07d01983: Automatically sever databases after prolonged unreachability!Apr 11 2016, 3:44 PM

Harbormaster completed building B11652: rPebff07d01983: Automatically sever databases after prolonged unreachability.

oh, harbormaster

That stuff is me pushing the host before the tests can finish, so they fatal when trying report.

I should probably make that a little more forgiving. I also need to regenerate a new quickstart.sql at some point so the tests run faster.

Changes (6)

Path

Size

src/

__phutil_library_map__.php

applications/

cache/

PhabricatorCaches.php

config/

controller/

PhabricatorConfigClusterDatabasesController.php

infrastructure/

cluster/

PhabricatorDatabaseHealthRecord.php

PhabricatorDatabaseRef.php

env/

PhabricatorEnv.php

rPebff07d01983

src/__phutil_library_map__.php

Loading...

src/applications/cache/PhabricatorCaches.php

Loading...

src/applications/config/controller/PhabricatorConfigClusterDatabasesController.php

Loading...

src/infrastructure/cluster/PhabricatorDatabaseHealthRecord.php

Loading...

src/infrastructure/cluster/PhabricatorDatabaseRef.php

Loading...

src/infrastructure/env/PhabricatorEnv.php

Loading...