Page MenuHomePhabricator

Slow POST/PUTs
Closed, ResolvedPublic

Description

Our team is complaining that requests are timing out or taking up to 60 seconds to complete. This happens when adding or editing tasks. This is our 5th day using the service, and so it's pretty important that it performs well.

Event Timeline

devinfoley updated the task description. (Show Details)
devinfoley added a project: Phacility Support.
devinfoley added a subscriber: devinfoley.

Thanks, we're investigating. If you have any other specific details, let us know.

(We caught up in IRC.)

I haven't found any smoking guns, but I did find some evidence that there was a transient load issue on the db shard for your instance. I'm not exactly sure what the direct cause was, but your instance is recent enough that the shard it's on is still open to new allocations, which tends to mean you get more load-related churn (from test instances coming up and down, and from new instances doing repository imports, which are resource intensive).

The shards are in the rough vicinity of full anyway, so I'll put some new hardware in the pool and swap over the open shards. That should generally have a stabilizing effect. I'm not certain that will actually fix whatever the issue was, but it can't hurt, and should stop the problem if it was general load bleed from other activity by instances using the same resources.

There are a couple of I/O spikes on the DB chart I'll look into once that's done too -- they're brief, but maybe we have an instance doing something sketchy/abusive.

I'm going to expand the web pool soon, too, just as a general headroom/burst-resistance measure, but load there looks good and this doesn't immediately appear to be related to any sort of web-tier resource exhaustion.

epriestley claimed this task.

Unsatisfying conclusion here, but:

  • I wasn't able to directly observe or reproduce the issue or conclusively identify a root cause.
  • My best guess, based on circumstantial evidence and gut feelings, is that this was load on the repo tier spilling over to the db tier and impacting writes from the web tier.
  • Operating under this assumption, I took steps to reduce load on the repo/db nodes for your instance, which were nearing full and would have been cycled out soon anyway (likely Saturday during weekly maintenance).
    • I closed your shards to new allocations and opened new ones (see T9581).
    • I pruned some instances and cycled processes on repo003 to ward off evil spirits.
    • Load on the shard should also drop over the next 48 hours (as test instances expire) and continue to drop over the next ~45 days (as instances created with the "New Standard Instance" button that are really test instances get pruned).
  • I looked at other instances on these nodes for indicators of abusive/sketchy behavior, but didn't immediately find anything that seems out of line.
  • I've issued a 24 hour service credit for the disruption. This should be reflected on your next invoice.
  • Please let us know if you experience further issues. Although I think the odds are in our favor that things will be stable now, I could easily be on the wrong track here or have missed something.