Page MenuHomePhabricator

Reduce the impact of "admin" backups on instances
Open, NormalPublic


The database backups on admin run around 6:30 PM PST and currently cause sustained load for what looks like ~10-ish minutes. During this time, instance performance can be negatively impacted.

Reduce Backup Size

The most direct lever we can pull is to reduce the size of this backup so it runs for a shorter period of time. From bin/storage probe, the largest tables are:

    file_storageblob                                477.5 MB  3.3%
admin_conduit                                     1,050.5 MB  7.2%
    conduit_certificatetoken                          0.0 MB  0.0%
    conduit_token                                     0.3 MB  0.0%
    conduit_methodcalllog                         1,050.2 MB  7.2%
admin_leylines                                   12,355.2 MB  84.6%
    leylines_checkpoint                               0.0 MB  0.0%
    leylines_checkpointtransaction                    0.1 MB  0.0%
    leylines_dimension                               26.0 MB  0.2%
    leylines_event                               12,329.0 MB  84.4%
TOTAL                                            14,601.7 MB  100.0%
  • More than 90% of this data is basically junk we don't need to be backing up.
  • The Leylines table is just user tracking for ad clicks.
    • Short term: archive the data and truncate the table.
    • Mid term: I think we can filter and throw away most of this data (a lot of it is, e.g., crawlers hitting one page and not setting cookies).
    • Long term: Probably move "data warehousing" to separate hardware with more appropriate backup behavior.
  • The conduit_methodcalllog table uses a 7-day window but instances make many calls to pull service metadata.
    • Short term: This table has little value. We could reduce this window to 24 hours without really losing anything.
  • The file blob table is also kind of big. It looks like S3 is configured, but there may be some files we can get rid of or migrate (e.g., old data from before S3 got set up) and/or the MySQL maximum storage size might be set a little high.

Reduce Backup Impact

Backups are written to a separate bak volume so the writes shouldn't cause I/O contention, in theory. The read from the database will cause some general load but I don't think there's a ton we can do about that.

I'm guessing the gzip might actually be responsible for a big chunk of the load. We could try to verify that, then try:

  • Using nice to reduce the CPU priority of gzip.
  • Using faster but less-thorough compression (gzip --fast).

Reduce Load Impact

We could reduce the impact of load on instances, so high admin load has less effect on other services.

  • The method is currently much slower than it needs to be. T12297 has started with fixing it. I believe there are substantial improvements to be made here.
  • We could put fancier things in place here (timeouts, a separate cache layer) but I think these increase complexity and are overreactions to the immediate issue. When we eventually make this component more sophisticated I think we'll have a different set of problems we want to use to pick a solution -- mostly, the ability to push configuration changes quickly so we can, e.g., blacklist a client address with a single configuration change or quickly push instances into and out of read-only/maintenance mode during deployments.

Align Risk and Response

We can move this process from 6:30 PM PST (lower staff availability) to 11:00 AM PST (better staff availability) to improve our ability to respond to any future issues.

Long Term: Dump Snapshots from Replicas

The long-term fix here is to dump snapshots from replicas, not masters. There's currently no admin replica, but we could move toward a multi-node admin tier. This is not completely trivial since db + repo + web are all overlayed, but likely not very difficult given that we have run a similar setup on secure for quite a while without any real issues.

These things would also need to become replica-aware:

  • The backup process needs to know that it should be acting on the replica, not the master.
  • The backup process should probably sanity check the replica (replicating + from the right master + not too far behind master) before running so we don't get a green light from backups and go find we actually made stale backups of the wrong thing later.
  • The provisioning process needs to know that bak volumes only mount on replicas when a master + replica is configured.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rP Phabricator

Event Timeline

epriestley added a revision: Restricted Differential Revision.Apr 26 2017, 7:20 PM

A big part of this appears to be that dumping the huge Leylines table locks it, so requests can't record their own Leylines data:

| 27164749 | web-xxx | localhost | admin_leylines   | Query   |    98 | Waiting for table level lock | INSERT INTO `leylines_event`
          (agentKey, sessionKey, type,
            xDimensionID, yDimensionID, zDimensionID,
            epoch) VALUES ('xxx', 'xxx', 'aura', 0, 0, 0, 1493234603), ('xxx', 'xxx', 'halo', 7, 0, 0, 1493234603)      |

One "fix" for this would be to disable Leylines for Conduit (or, as a heavier hammer, disable it for all intracluster requests).

epriestley added a revision: Restricted Differential Revision.Apr 26 2017, 7:49 PM

I used bin/files migrate to move some old files which were stored in MySQL before S3 was configured to S3. The file blob table is now down to 160MB from 477MB.

epriestley added a commit: Restricted Diffusion Commit.Apr 26 2017, 8:05 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.

Deploying Leylines + GC changes to admin now.

I think the Leylines change didn't actually work because we make the call via the LB so the request appears to coming from outside the cluster.

GC has caught up to Conduit and the conduit_methodcalllog table is down to 160MB from ~1GB.

Leylines is down to 26MB but needs a better fix for the internal traffic since it's accumulating at the same rate as before.

The new backup stuff should run in a couple of hours.

I'm not going to try to time-shift this yet since everyone is going on vacation forever, but I'll probably write a bin/trigger adjust command or something. I could imagine instances wanting to be able to move instance backups to off-peak for their users in the future, and if we set up a mechanism for that it'll be easier to add a UI button later.

The next actions I plan to take here are:

  • Fix bin/host download (T12651). This isn't exactly related but it came up during other ops work yesterday.
  • Fix the rCORGI dependency (T12652). This came up yesterday when I was deploying admin to update the GC settings.
  • Fix the adjustment to Leylines so that we really stop logging metrics for internal Conduit traffic, likely by letting the event handler have a little more information about the request.
  • Attempt to improve the performance of the call (per T12297).

Two other thoughts:

  • It might be nice to add an X-Phabricator-Origin: HTTP header to intracluster and pseudo-intracluster requests. When a web or repo host makes a request for instance details, identifying the requesting host is currently somewhat involved (X-Forwarded-For). This data isn't available in the access logs right now, although it should be once the application HTTP log deploys. It will also be legitimately unavailable for requests which hit the external load balancer once we put a NAT gateway in place (T11336) since the traffic would go out through the NAT gateway, over the internet, and hit the external interface on the LB.
  • Alternatively (or additionally) it would probably be nice to send this traffic through an internal load balancer, instead of through the external interfaces on alb001. That would preserve the origin node in X-Forwarded-For, could give us somewhat better options for monitoring and responding to issues, and generally keep cluster traffic in the cluster. I don't see a particularly strong reason why we definitely want to do this (and it does make overall configuration a bit more complicated) but it feels a little cleaner.
epriestley added a revision: Restricted Differential Revision.Apr 28 2017, 3:07 PM

Attempt to improve the performance of the call

From XHProf, these actually appear reasonable already. I'm seeing ~20ish milliseconds from XHProf, but ~100ish milliseconds from instances. It's possible that the 80ms discrepancy is from Leylines, which fires writes after XHProf finishes. I'm going to deploy D17804 first and see how things look after that.

epriestley added a commit: Restricted Diffusion Commit.Apr 28 2017, 3:33 PM
epriestley added a commit: Restricted Diffusion Commit.

One broad issue is that sees a generally high, sustained request rate of calls (up to dozens per second). I think the uneven performance I've sometimes observed is just load-related, and that this load is sometimes high enough to max out the Apache workers and cause queue stalls.

These are presumably coming from the web and repo tiers to fetch instance status information ("does this instance exist?"), although I can't conclusively rule out a bug somewhere (either X-Phabricator-Origin or a preserved X-Forwarded-For would help a bit, per above).

I think this request rate is just expected, though. The cache has a 5 minute TTL, but dividing the number of active instances by the TTL lands us in the ballpark of the request rate.

In almost all cases, these cache fetches are fruitless, because it is very rare for instances to change status (normally, once every several months). As with other infrastructure, the current cache design makes fairly good sense when the tier is revenue-generating (it's very simple, we can just add more admin hosts to scale it, and instances get updated promptly) but does not scale gracefully to a mostly-free tier.

Some approaches we could take:

  • Add more admin hardware (pro: something we should do eventually anyway; con: complex, doesn't really solve the problem, just sweeps it under the rug).
  • Push out cache dirties, then raise the TTL (pro: reduces load in a steady state by an arbitrary factor; con: complex, could still hit thundering herd problems, opens more windows for cache inconsistencies).
  • Change instances to bulk-fetch caches (pro: probably much better for repo hosts, limits thundering herd; con: probably much worse for web hosts, especially at scale, dirties get messier).
  • Put an intermediate lightweight cache layer in place [memcache, direct database cache] (pro: kind of improves things, might be better in some sense than the disk garbage we're doing now; con: doesn't solve the problem, adds a lot of complexity, no other current use cases for this cache, disk garbage is pretty simple/reliable).
  • Make the cache change-based instead of state-based. The rate of changes to this cache is very low, so hosts could reasonably fetch and write an initial cache state on deploy, then just subscribe to changes to the cache. This makes the cache itself more complicated, but might position us better in the long run, at least for repo hosts. I think web hosts probably need to stay on the state-based readthrough cache plan, though? So maybe this isn't really such a hot idea.

I'm inclined to:

  • At least plan (and possibly execute) expanding the admin tier since this is something we should do anyway for redundancy, and there shouldn't be any major issues now that secure has been stable in a similar configuration for long time.
  • Probably put an internal LB into production and send the internal admin traffic through that, to preserve X-Forwarded-For and keep the traffic entirely in the cluster.

I'm inclined to look at a bulk or change-based cache for repo hosts, since it's a good fit for them, although I'm not excited about having multiple different cache designs.

epriestley added a revision: Restricted Differential Revision.Nov 28 2017, 3:51 PM
epriestley added a commit: Restricted Diffusion Commit.Nov 28 2017, 7:14 PM