Page MenuHomePhabricator

Automatically sever databases after prolonged unreachability
ClosedPublic

Authored by epriestley on Apr 10 2016, 9:51 PM.
Tags
None
Referenced Files
F14410161: D15677.id37783.diff
Tue, Dec 24, 7:36 AM
F14407122: D15677.id37783.diff
Tue, Dec 24, 1:53 AM
Unknown Object (File)
Sat, Dec 21, 1:48 AM
Unknown Object (File)
Fri, Dec 20, 2:52 PM
Unknown Object (File)
Fri, Dec 13, 7:55 AM
Unknown Object (File)
Wed, Dec 11, 11:31 PM
Unknown Object (File)
Mon, Dec 9, 6:41 PM
Unknown Object (File)
Mon, Dec 9, 6:23 AM
Subscribers
None

Details

Summary

Ref T4571. When a database goes down briefly, we fall back to replicas.

However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related).

Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database).

We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed.

This is similar to what most load balancers do when pulling web servers in and out of pools.

For now, the specific numbers are:

  • We do at most one health check every 3 seconds.
  • If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes).
  • If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it.
Test Plan
  • Configured a bad master.
  • Browsed around for a bit, initially saw "unrechable master" errors.
  • After about 15 seconds, saw "major interruption" errors instead.
  • Fixed the config for master.
  • Browsed around for a while longer.
  • After about 15 seconds, things recovered.
  • Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good:

Screen Shot 2016-04-10 at 2.50.41 PM.png (136×658 px, 25 KB)

Diff Detail

Repository
rP Phabricator
Branch
readonly12
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 11647
Build 14570: Run Core Tests
Build 14569: arc lint + arc unit

Event Timeline

epriestley retitled this revision from to Automatically sever databases after prolonged unreachability.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad edited edge metadata.
This revision is now accepted and ready to land.Apr 10 2016, 10:36 PM
This revision was automatically updated to reflect the committed changes.