Page MenuHomePhabricator

repo002.phacility.net is heavily loaded
Closed, ResolvedPublic

Description

repo002 is under unusually high load and at least one instance is seeing degraded performance on repository operations as a result.

This doesn't immediately seem to have a single cause or a quick fix: there isn't a test instance importing 200 copies of the linux repository or anything like that.

Some general stuff:

  • Repository retries (failed mirrors, failed fetches) have no real backoff, and these operations don't seem to be particularly cheap (maybe git does a lot of work locally before fetching?). There are a lot of misconfigured repositories across all active instances. We could make the PullLocal daemon backoff more quickly on failure. T7865 is related, although that may have been partially fixed by T11506.
  • Three larger instances have recently started large repository imports.
  • There's limited immediate visibility into which instances are creating load, since most of ps auxwww is not tagged with instance names. Active queue sizes can be identified like this:
$ host query --query 'SELECT COUNT(*) FROM <INSTANCE>_worker.worker_activetask' --instance-statuses up
  • Daemon queue length is a known interesting datapoint to collect in eventual monitoring (T7338).
  • host query should really imply --instance-statuses up by default or something like that.
  • One sort of crazy idea is that, in the short term, I might be able to bring up a spare repo host and just use it to work through the queues on the three hot instances faster. However, this might hurt things since repo002 is ultimately serving all the VCS requests for these queues.
  • It wouldn't really help here, but in the general case it might be nice to allocate test instances on dedicated shards. They tend to be extremely high-variance and shard resource usage would be more predictable without them. I have no evidence to suggest they're creating a meaningful amount of load in this case, but they complicate things at the least and sometimes create similar problems.
  • I could alternatively throttle down the highly active instances (or all instances on this shard) but that's not too great.
  • Generally, reducing the cost of import steps would help everything, although I'm not sure how much freedom we have to improve things easily.

Event Timeline

Also:

  • Maybe a good time to launch new shards? Last tier size increase was 10 months ago (late November, 2015) when I fixed the allocation algorithm to be not a huge wasteful pile of garbage that required us to constantly throw hardware at it.
  • Maybe the allocation algorithm should consider active users, not just active instances? I think the 002 shard has the most active users by a hefty margin, just by luck of the draw. But this will tend to be moot at scale.
  • Maybe start moving instances between shards? This seems hard to do without downtime today, and doing all the legwork to reduce downtime through clustering is foreboding.

I could alternatively throttle down the highly active instances (or all instances on this shard) but that's not too great.

I did this (reduced phd.taskmasters to 1 in PhacilitySiteSource locally, then restarted the shard) since it's the quickest approach I can take to try to resolve user-facing timeouts. Queues will process more slowly as a result, but the net effect on queue speed may be roughly unchanged since the heavy load was already slowing down the queues significantly. Waiting for things to come back up to see if that improved load on the host at all.

Repository retries (failed mirrors, failed fetches) have no real backoff

I'm planning to attack this next, since it's the most straightforward and may be substantial. Cost for, e.g., a fetch with invalid credentials on repo002 (under load) currently is ~6s, with approximately 4s spent doing two git fetch operations (when the first one fails, we try to correct the origin URI and fetch again). Both are immediately rejected by the remote.

Approach is likely:

  • When deciding on the backoff after an error, look for an existing error message of the same type.
  • If we find one, examine it for a "repeat" number.
  • Multiply the backoff by the "repeat" number.
  • Increment the repeat number.
  • Also, compute this (and other pull frequencies) at startup to resolve T7865.

This should produce backoff values of 15s, 30s, 45s, 60s, etc. In the first 24 hours, we'll retry about 100 times total and end up with a backoff of about 25 minutes, which seem reasonable.

It would also be nice to git fetch <url> instead of git fetch origin + try to recover from bad origin. Not sure if this will get us into trouble or not.

D16574 retains update cooldowns (so we don't immediately update everything after a restart) and sets up for the following change.

D16575 adjusts backoff behavior for repositories with errors.

I've merged these to stable and am going to deploy them to db002 / repo002 shortly.

Those changes seem stable. Unclear how much they're really helping, but load is looking at least somewhat better. I'm going to dig into things a bit more and see if I can get a better sense of whether we're on the path to recovery or not.

One remaining issue with backoff/cooldown is that mirrors aren't included: a regularly-updated repository with broken mirrors still tries mirroring. This is somewhat more involved to fix, though (see T6131).

A contributing issue is that one instance appears to be making clone requests for approximately 30 repositories per minute to this shard. This is easy to rate limit, but probably not easy to rate limit in a generalized, not-a-cluster-hack way.

From T11666, NEEDS_UPDATE messages are being written by the proxying intermediate host, which currently writes them incorrectly after D16575. This is wrong, but can't meaningfully be fixed without pushing the whole tier, which would fix the default column value anyway. Alternatively, I can add a default to the messageCount column with a narrower patch.

I've caught some possibly-fixable inefficiencies in bin/repository update too, but not sure how much how fixable/effective they actually are.

We do appear to be on the road to recovery, but the queues are still draining verrrry slowly.

I don't know if this is related but since an hour or so, whenever I push I get an error even though the push appear to work (the commits show up in Diffusion):

$ git push
# Push received by "web.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "repo002.phacility.net"...
# Acquired read lock immediately.
# Device "repo002.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "repo002.phacility.net".
Counting objects: 56, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (50/50), done.
Writing objects: 100% (56/56), 9.30 MiB | 1.84 MiB/s, done.
Total 56 (delta 32), reused 1 (delta 0)
# Released cluster write lock.
AphrontQueryException: #1364: Field 'messageCount' doesn't have a default value
To ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git
   015051e..67b218b  master -> master
error: failed to push some refs to 'ssh://hyper@vault.phacility.com/diffusion/HYPER/hyper.git'

@swisspol, that's what this refers to ("From T11666"):

From T11666, NEEDS_UPDATE messages are being written by the proxying intermediate host, which currently writes them incorrectly after D16575. This is wrong, but can't meaningfully be fixed without pushing the whole tier, which would fix the default column value anyway. Alternatively, I can add a default to the messageCount column with a narrower patch.

I disabled the keys of the user running 30 clones/minute and notified the instance administrator. This appeared to create some headroom; I reverted the daemon pool size limits to the default and we're draining the queues at a somewhat more reasonable pace now.

I pushed a patch to add DEFAULT 0 to the messageCount column, which I think should fix the mismatched update from the proxy tier. Let me know if you're still seeing an issue.

I also identified an instance with a bad HTTP hook which had managed to queue up 8,000 tasks in the queue. These already have scaling backoffs and likely didn't contribute a meaningful amount of load, but we could perhaps find a more graceful approach to handling bad webhook URIs during T11330.

Looking at improving the raw performance of bin/repository update, one fixable issue is that we run git for-each-ref three times:

  • PhabricatorRepositoryDiscoveryEngine runs it, then immediately runs it again almoste unconditionally to get an observed repository version, even if the version won't be used.
  • Later, PhabricatorRepositoryRefEngine runs it again.

The first one is easy to eliminate or reuse; the second one should be possible to reuse.

I'm not sure this is actually meaningful, but it was substantially slow under load (~1s+ for a 5K-ref repository).

I let the queues drain for a bit overnight. One of the large queues is down from ~16K items to ~12K items in the last 8 hours, but I don't want to just let this sit for 24 hours to "resolve" it. I'm going to look at moving instances across shards instead, which is a capability we need to automate eventually anyway.

I've moved the largest instance from repo002 to a new empty shard, which has put both shards into fairly reasonable range. Queues still aren't draining as quickly as I'd like/expect.

I identified one bug which may be leading to degraded performance, particularly while processing Herald tasks (above). I'm deploying this to repo002 now.

rP9329e6a12d9f improved the rate at which the Herald queue is draining by approximately 10x.

All queues have emptied, and repo002 is back in a healthy range again:

Screen Shot 2016-09-20 at 9.11.52 AM.png (576×915 px, 59 KB)

repo006 is also calm now:

Screen Shot 2016-09-20 at 9.12.38 AM.png (573×911 px, 51 KB)

Overview of changes:

  • D16574 + D16575 improved polling and retry behavior.
  • T11670 describes a functional process for load shedding by moving instances from hot shards to cool ones.
  • rP9329e6a12 fixed a substantial performance issue with diff fetching from certain types of tasks.

Future work:

  • Mirrors should have similar behavior and more granular error handling (T6131 / T5965).
  • Configuring better SSH logging and, ultimately, setting up rate limits / quotas (some vague discussion in T7593) might be helpful.
  • The update process likely has significant remaining inefficiencies (like extra for-each-ref calls), especially considering how substantial the log -n1 fix was.
  • This reinforces the value of daemon queue monitoring (T7338).
  • It would be nice to move test instance allocation to isolated shards, just to simplify situations.
  • HTTP hook behavior for bad hooks is a little questionable (T11330).