Allow Phabricator to run in Read-Only Mode
OpenPublic

Description

This came up in IRC, but there's some interest in read-only Phabricator. Some use cases:

  • Lower-effort alternative to real HA which doesn't require intervention to switch or manage database masters. This was the driving use case here. Basically, trade writes away to get lower administrative costs.
  • Allow Phabricator to stay up in read-only mode while updating, performing maintenance, etc, without risking data loss. This has come up as a nice-to-have in the past.

I think we can do this relatively easily. Roughly:

  • Implement a read-only config flag. In this mode, we throw when establishing a write connection.
  • No-op all the nonessential writes we perform when read-only is set (user access logging and cache fills come to mind).
  • ApplicationSearch might be a little tricky, but should be able to switch to a pure GET mode.
  • Eventually, remove write options from the UI, too, to make it clearer to users that they can't do writes.

The only tricky part that's critical is that we need an alternate, writable session store. Some ideas on this:

  • The most obvious thing is to write things to SQLite or disk, but this seems fairly complicated, won't generalize, and won't scale across multiple readers.
  • We could write to a second MySQL, or a specific table in MySQL which is only present for readers. This would work, but require more work and configuration from everyone.
  • We could issue sessions without requiring storage:
    • Define a new readonly session type which is only valid for readonly installs.
    • The session cookie is <time, hash(time + user secret + application secret)> and good for a short period of time. Where we currently extend the session, we rotate the cookie instead.
    • This isn't as nice as a writable session store from a security perspective, but is much simpler than alternatives.

After thinking about this, I lean heavily toward just defining a read-only session type for dealing with auth, since it's like 10x less dedicated code than anything else, and can be made approximately as secure.

epriestley created this task.Via WebMar 7 2014, 5:44 PM
epriestley claimed this task.
epriestley added subscribers: epriestley, jbrown.
epriestley edited this Maniphest Task.Via LegacyMar 7 2014, 5:44 PM
jbrown added a subscriber: frgtn.Via WebMar 7 2014, 5:48 PM
joshuaspence added a subscriber: joshuaspence.Via WebJun 13 2014, 6:03 PM
bartus added a subscriber: bartus.Via WebJun 15 2014, 6:08 PM
jevripio added a subscriber: jevripio.Via WebJun 17 2014, 9:22 AM
hach-que added a subscriber: hach-que.Via WebAug 2 2014, 1:45 AM
chasemp awarded a token.Via WebAug 12 2014, 6:05 PM
chasemp added a subscriber: chasemp.
epriestley added a project: Phacility.Via WebNov 22 2014, 11:09 PM
epriestley added a subscriber: btrahan.

+@btrahan, per offline conversation.

We don't need this for v0, because we'll stop the world (prevent reads and writes to all instances) for schema changes. This shouldn't be hugely worse than a more subtle approach until we're at greater scale.

We probably don't need it for v1, since the v1 plan is probably double writes: push new code to all webservers, then migrate instances using full R/W stops one at a time.

Somewhere around v2 we could replace "stop all reads and writes" with some sort of dance like "put instance in readonly mode; take master out of the pool; stop replication; upgrade master; put master back online; take replica offline; resume writes; resume replication; return replica to pool once it catches up".

When moving logical instance DBs between physical DB servers, having readonly also lets us improve from "stop the instance, dump data, load data, start instance" to "put instance in readonly, dump data, load data, resume writes", and eventually to "dump data, load data, being replication, once replication catches up swap which node is master, resume writes".

Finally, this gives us a better disaster recovery story, where we might be able to restore a damaged instance as readonly (e.g., from the last backup) while recovering a better dataset (e.g., replaying replication logs).

I'm not sure if we'll feel more pressure to improve migrations or improve load spreading or improve disaster recovery (hopefully not) first. There may also be a variety of ways we can cheat on migrations (for example, migrations which only add new tables can just be pushed to the DB tier first without any downtime), so I'm also not sure how far we can get without building this.

However, since we're very likely to need it eventually and it will impact a lot of stuff (there's probably a fair amount of effort in making applications understand readonly mode so we get something serviceable out of it) it may be worthwhile to begin building sooner rather than later so we don't have any surprises if the need for it becomes more urgent later on.

epriestley moved this task to Do After Launch on the Phacility workboard.Via WebNov 22 2014, 11:09 PM
sokcevic added a subscriber: sokcevic.Via WebDec 1 2014, 5:54 PM
epriestley added a comment.Via WebDec 14 2014, 12:00 PM

Cluster tokens (T5955, D10990) will also need a readonly alternative once this starts getting built out.

epriestley moved this task to Do Eventually on the Phacility workboard.Via WebFeb 20 2015, 3:57 PM
epriestley added a comment.Via WebFri, May 15, 12:54 PM

In T8209, we hit an issue where a migration performed a cache fill as a side effect of loading handles. This created a problem because the cache fill happened from a migration that runs before the cache was introduced.

This sort of interaction is likely to become more common in the future now that we're dipping our toes into some level of readthrough caching (see T7707).

We could reduce the likelihood of surprises during migrations by introducing a "migration" level between "read/write" and "read only". This level would disable side-effect writes (cache fills, auxiliary logging, multimeter, xhprof profiling) but not complain about explicit writes. This is a little bit magical, but I think it's generally OK: it doesn't really add appreciable complexity over having "read/write" vs "read only" modes (code still needs to handle both cases), and it should meaningfully reduce our exposure to incidental writes during migrations. I think we'd increase stability overall by making this tradeoff between complexities.

Add Comment