repo002 is under unusually high load and at least one instance is seeing degraded performance on repository operations as a result.
This doesn't immediately seem to have a single cause or a quick fix: there isn't a test instance importing 200 copies of the linux repository or anything like that.
Some general stuff:
- Repository retries (failed mirrors, failed fetches) have no real backoff, and these operations don't seem to be particularly cheap (maybe git does a lot of work locally before fetching?). There are a lot of misconfigured repositories across all active instances. We could make the PullLocal daemon backoff more quickly on failure. T7865 is related, although that may have been partially fixed by T11506.
- Three larger instances have recently started large repository imports.
- There's limited immediate visibility into which instances are creating load, since most of ps auxwww is not tagged with instance names. Active queue sizes can be identified like this:
$ host query --query 'SELECT COUNT(*) FROM <INSTANCE>_worker.worker_activetask' --instance-statuses up
- Daemon queue length is a known interesting datapoint to collect in eventual monitoring (T7338).
- host query should really imply --instance-statuses up by default or something like that.
- One sort of crazy idea is that, in the short term, I might be able to bring up a spare repo host and just use it to work through the queues on the three hot instances faster. However, this might hurt things since repo002 is ultimately serving all the VCS requests for these queues.
- It wouldn't really help here, but in the general case it might be nice to allocate test instances on dedicated shards. They tend to be extremely high-variance and shard resource usage would be more predictable without them. I have no evidence to suggest they're creating a meaningful amount of load in this case, but they complicate things at the least and sometimes create similar problems.
- I could alternatively throttle down the highly active instances (or all instances on this shard) but that's not too great.
- Generally, reducing the cost of import steps would help everything, although I'm not sure how much freedom we have to improve things easily.