Page MenuHomePhabricator

Build smarter CLI caching for instance data in the Phacility cluster
Closed, ResolvedPublic

Description

Currently, there's no CLI caching for instance data, which means that it gets fetched every time on script startup. Particularly for the PullLocal daemon, which runs several Phabricator subprocesses, this creates a significant amount of load on the admin tier (>1M requests/day).

This isn't really causing any concrete problems today, although I've occasionally seen what appear to be load-related failures during deploys, which I think stem from the large additional burst of API queries that deploys cause.

We should put a disk cache in front of the API call so that, e.g., repository subprocesses hit the disk cache most of the time. This cache does not need to be particularly aggressive about invalidation as long as management commands (like RestartWorker) can invalidate it explicitly.

My plan is:

  • Put a disk-based, per-instance cache in tmp/cache/cluster/<instance>.json.
  • Add it to the PhacilityServices::getClusterCache() stack.
  • Provide a way for administrative commands (deploy, restart) to purge it (by nuking the file wholesale).
  • Purge the cache when upgrading (entire host) or restarting daemons (instance only).

This should drop API load by a large factor (99%?) with no visible effect on behavior.

Revisions and Commits

Event Timeline

epriestley created this object with edit policy "Custom Policy".

It's also vaguely desirable if the bulk processes (deploy/upgrade) can explicitly prefill the disk caches after mass-purging, since a bulk prefill is significantly more efficient than doing a readthrough fill moments later when the daemons restart. This only smooths out API burstiness during deploys, though.

epriestley added a commit: Restricted Diffusion Commit.Aug 30 2016, 7:02 PM
epriestley added a commit: Restricted Diffusion Commit.

Those changes add a fairly coarse disk-based cache here for use by the CLI. I expect it to shed ~95% of the load and make the admin tier stop glowing white-hot 24/7.

It's difficult to test this convincingly in development so it may hit some rough spots when it goes to production, but the actual implementation is fairly simple and my short-term goal is just to reduce load here; we're close to 1.1M service calls a day now in Multimeter.

The biggest weirdness with this implementation that the cache is partitioned per-system-user to avoid dealing with disk permissions. I'd rather be deploying some lightweight service or exposing the APC cache to the CLI (although this has security implications and startup/initialization-order implications) but these are complex and likely don't give us any huge wins over this approach.

It would also be nice if Multimeter could record device labels and/or calling contexts for API calls. Currently, all intracluster requests have <none> as a viewer and api.whatever.whatever as a context (same as the event label), so I'm sort of guessing about the nature of this load. Multimeter would paint a clearer picture if viewer was something like instancename/hostname and context was bin/repository update.

epriestley added a commit: Restricted Diffusion Commit.Aug 31 2016, 4:05 PM

I'm planning to deploy this tomorrow morning. It also looks like the change associated with D16262 (small cache improvement) got a little lost in the switch to bin/host release and never actually deployed, so that's going out too.

rSAAScfc7ad48 (query instances by devicePHID) is also going out, and I can do some significant followups in rCORE if that sticks.

My plan here is roughly:

  • (Maybe: re-measure the ManageCapability cost?)
  • Deploy admin.
  • Test the ManageCapability cache.
  • Test instances.queryinstances by devicePHID.
  • Deploy the db tier normally.
  • Deploy repo001 (without restart?)
  • Test caching with meta.
  • Bring up repo001 fully.
  • Deploy the rest of the repo tier.
  • Deploy the web tier; test normally.
  • Monitor admin to see if we actually realize a load reduction.

I've deployed all the auxiliary tiers, admin is up next.

Rough pre-deploy cost is 3.2s spent in 111 queries for the "All Instances" page, although this wall-time cost is largely load-related and may not really be comparable to the post-deploy performance. I'm reviewing D16262 and deploying admin now.

The instance management cache seems OK, new cost is 11 queries in 80ms. The wall-time isn't directly comparable but both numbers being smaller is good.

I've deployed repo001 and the disk cache appears to be holding. I've checked a bunch of operations and logs without catching anything that looks broken, so I'm going to roll the rest of repo, then db, then web, then look for load effects on admin.

Load wasn't dropping as much as I expected, but that seems to be because someone started making 20 requests / second to list badges available on admin in the last 24 hours. Shutting that down now.

rCOREddab1d4e shut down the user who was really excited about badges. Looks like they started about 15 hours ago.

Load wasn't dropping as much as I expected, but that seems to be because someone started making 20 requests / second to list badges available on admin in the last 24 hours. Shutting that down now.

lulz, really?

Yeah -- I assume they wanted to know as soon as we create a badge, and can't afford to wait more than 50ms!

I've been fiddling with host destroy while letting things settle. Here's load on admin001 over the last week:

load_graph.png (575×915 px, 73 KB)

On the left, you can see load steadily rising (this isn't quite as bad as it looks since the load is somewhat cyclical on the workweek, and the lower values on the left side of the graph are last Saturday/Sunday -- so the tier was pretty hot, but not actually going to catch on fire if this deploy didn't work right now, at least until later in the week).

Point "A" is when the user with a great interest in badges started issuing queries, pretty much pegging the host at 100%.

Point "B" is when I began deploying the instance authority cache and new disk cache. Whatever effect these had isn't visible since the badges queries just took up the slack.

Point "C" is when I politely asked the badges user to stop making so many requests every second.

Point "D" is the new utilization, with some noise related to me doing other deployment stuff. It looks like we're steady at around 5% when zoomed in, so this appears to have worked and reduced load by roughly 95%.

As always, I'm proud to be an industry leader in the emergent field of "Just-in-Time Operations".

I caught an issue with D16464 and lazy/partial parsing of the "Host:" header on the repo tier, deploying a fix for that now.

(rSERVICES1fcb5cdb758211f6389729956d7c77f90edc2243)

This appears to be deployed and stable. I'm going to leave this open for 24 hours so I can get a reading out of Multimeter tomorrow, but looks like everything worked as intended.

I didn't switch over to devicePHID-based querying from tools yet, so they're still running API calls in a pretty inefficient way. However, the calls are way faster no so there's less immediate benefit to improving this.

The 24-hour data in multimeter shows a drop from about 1.1M requests per day to about 41,000, so that's a slightly-better-than-95% reduction in request volume. The worst sample we caught in the last 24 hours for api.almanac.service.search was 228ms, and the median was closer to 70ms.

Here's the same chart from Saturday with more data:

Screen Shot 2016-09-05 at 8.20.41 AM.png (570×910 px, 89 KB)

I'm going to consider this resolved. There's an opportunity to upgrade to devicePHID-based querying at some point in the future but that's mostly just a small quality-of-life improvement for operations.