Currently, Phabricator can be configured in a database cluster mode with a master database and zero or more read replicas.
While the master is reachable, we currently send no (normal) traffic to the read replicas. However, we could safely serve some traffic (like read-only logged-out traffic) from replicas. Doing so would reduce load on the master. The largest beneficiaries of this are likely to be active, open-source installs with a large public presence. Private installs may also benefit, but T11044 is probably generally more fruitful.
One additional class of beneficiary is installs with physically distant locations (e.g., a San Francisco office and a Mumbai office). Being able to send traffic originating in the Mumbai office to a local database server in Mumbai could improve performance substantially for users at that location.
The major technical problem that needs to be solved before we can support this is the "read-after-write" problem. It looks like this:
- You submit a comment on a task. Your comment is written to the master. This happens at T+0.
- We redirect back to the task page, /T123.
- The server processes the request for this page at T+1. This page load is read-only, so it is served by the replica.
- But! The replica has a replication delay! Your data isn't there yet. Your comment doesn't appear on the page.
- You reload at T+2, T+3, and T+4, then give up. Your comment is gone! You write a long passive-aggressive tweet about how Phabricator destroyed your data.
- At T+5, replication finishes and the comment would appear if you reloaded again.
See also some discussion starting at T1969#19465.
The easiest way to get started with this is probably to enable it for logged-out traffic only. This will let us navigate some special cases (like Conduit) relatively safely while avoiding the bulk of the read-after-write problem.
Afterwards, there are two major strategies we could pursue:
- block on the server until replication completes after a write;
- identify clients that recently wrote, and send their reads to the master for a while.
These second strategy is probably somewhat better for users, but also more complicated.
We may also need to identify pages which will write so reads before writes can go to the master. POST is a rough approximation of this. Presence of a valid CSRF token is another rough approximation. We're safe as long as we identify any superset.
We can also detect when we've already performed bad reads before we do a write so the outcome isn't unsafe, but we'd have to fatal at that point.
Likewise, we can detect when we went to the master and then didn't actually attempt a write to get a sense of how many false positives we're detecting.