Page MenuHomePhabricator

Phacility (Blockers)
Closed, ResolvedPublic

Description

Umbrella task for Phacility blockers.

Related Objects

StatusAssignedTask
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
ResolvedNone
ResolvedNone
Resolvedepriestley
Invalidchad
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Resolvedepriestley
Spiteepriestley

Event Timeline

epriestley added a project: Phacility.
epriestley added subscribers: epriestley, btrahan, chad.

I think we can do this without putting any state on the web tier (see discussion in https://secure.phabricator.com/T2775#comment-6). Basically, we'd just have a pool of web machines (like Facebook does), not dedicated machines per-install. Critically, this means we don't need on-demand provisioning, and thus don't need Drydock (we just need a much simpler working-copy service).

I'm going to proceed forward under the assumption that we'll implement this simpler architecture.

Okay, so here's how this is shaping up overall:

User / Provisioning Flow

  • User signs up for an account on phacility.com, which takes them into Phortune (T2787).
  • They purchase an install, add account/billing info, etc.
  • Phortune updates Phlux (T2793 / T2792), writing the host information into it.
  • (Some handwaving here because there are a few more setup steps that need to happen, like creating the database, creating the user's account, provisioning EBS, and attaching it to a machine in the daemon tier, but this is all simple and can be a thin layer on top of the task queue with a prettier UI. I'll make this more concrete. EBS doesn't need to block the install coming online, so this should only take a few seconds.)
  • User is redirected to their install and logged in. They can create user accounts, configure it, etc: everything works.

Request Flow

  • Wildcard DNS is pointed at some Layer 4 LB.
  • That LB (or some HAProxy in a pool behind it) terminates SSL and balances the request into a homogenous web tier.
  • Phabricator starts up and hits Phlux on the Phacility master to figure out which install is on the request's Host (basically, it examines the host header and makes a Phlux call to get configuration).
  • It dumps this config (basically, just DB credentials and a couple of locks) into the environment stack.
  • Phabricator continues normally, just with slightly altered config unique to the install.

Tiers

  • LBs: Initially, a single layer 4 LB. This can be either ELB or HAProxy. ELB won't let us balance SSH, but may be a lot cheaper / higher throughput than running HAProxy on instances. This probably needs to be HAProxy on an instance before too long, though, so maybe that's version 0. At some point, we get redundancy into this tier so it's not a single point of failure.
  • Web: The web tier is a larger, homogenous tier which receives both HTTP and SSH traffic. It uses Phlux to read configuration.
  • Daemon/Repo: The daemon tier runs daemons and mounts EBS blocks for them to check out repositories on. It also runs Phabricator, but does not serve web requests: it only serves VCS and Conduit over SSH. The web tier connects to it to execute Conduit commands against repositories (T2783), and routes SSH to it.
  • Database: The database tier has databases. I am genious.
  • Phacility: We should probably keep Phacility itself separate, I guess, and give it its own DB/Web machines. I think it has enough differences from the general install pool that it's worth treading specially. The new secure.phabricator.com can drop into the pool homogeneously, though (I guess as phabricator.phacility.com).

Upgrade Flow

  • We bump a variable in Phlux which brings every install down with a "maintenance" message (e.g., push.version from 239 to 240).
  • We stop the daemons, and push the web and daemon tiers with new code.
  • We upgrade all the schemata. When an install finishes upgrading, we bump an install-specific key in Phlux (e.g., install.version from 239 to 240), which brings it back online.
  • This sucks because we can't upgrade installs one at a time, but these steps can happen mostly in parallel and should probably mean very little downtime until we get very large. There are a variety of strategies we can use to attack this without enormous increases in complexity. The other cost is that if we push something bad, everyone is screwed at once. I think we can absorb these cost in the short-to-mid term, and the benefit of not needing Drydock or any real provisioning is enormous. On the costs in particular:
    • Downtime is not optimal, but should be small.
    • We can be more careful about writing schema patches, now that they have a more concrete cost (e.g., allow noncritical migrations to happen via task queue).
    • The web tier could queue these requests and poll for the version bump, so users would have no apparent downtime for upgrades of less than, say, 15 seconds. This might be a bad idea for other reasons, though.
    • Open source installs effectively provide a cutting edge / beta tier, to at least some degree, so it isn't like we're completely blind when we push.

The other cost is that if we push something bad, everyone is screwed at once.

We just have to be less careless. D5371 was kind of embarassing (to my defence, I was wondering about whether I should put the braces the other way).

mounts EBS blocks

I do have the feeling that about every three months all EBS storages within a region will fail in some way. It's annoying, and at some point certain parts of the blocks of EBS also corrupt indefinitely. Netflix is all into Instance Store, which is vast with many instance types (and ramdisk for software).

Regarding the DB tier, I'm not sure if you are into RDS or EC2, but you should really include something like automatic server restarts if something goes wrong. I'm not talking about DB failures here, but about Too many connections. It's annoying, and takes minutes to disappear. (I have observed this error on secure.phabricator.com, too, btw).

chad closed subtask Restricted Maniphest Task as Resolved.Jul 30 2014, 2:22 AM
epriestley closed subtask Restricted Maniphest Task as Resolved.Nov 5 2014, 11:30 PM
epriestley closed subtask Restricted Maniphest Task as Resolved.Nov 16 2014, 2:21 PM
epriestley closed subtask Restricted Maniphest Task as Resolved.Nov 16 2014, 8:22 PM
epriestley changed the visibility from "All Users" to "Public (No Login Required)".Nov 22 2014, 3:23 PM
epriestley claimed this task.

This task is obsoleted by the Phacility board.