Slow POST/PUTs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	devinfoley
	Oct 15 2015, 5:57 PM

Description

Our team is complaining that requests are timing out or taking up to 60 seconds to complete. This happens when adding or editing tasks. This is our 5th day using the service, and so it's pretty important that it performs well.

Related Objects
Search...

		Status	Assigned	Task
		Resolved	epriestley	T9580 Slow POST/PUTs
		Resolved	epriestley	T9581 Expand Phacility cluster repo and db tiers (add nodes: 004)

Event Timeline

devinfoley created this task.Oct 15 2015, 5:57 PM

devinfoley updated the task description. (Show Details)

devinfoley added a project: Phacility Support.

devinfoley added a subscriber: devinfoley.

Herald added subscribers: chad, epriestley. · View Herald TranscriptOct 15 2015, 5:57 PM

Thanks, we're investigating. If you have any other specific details, let us know.

(We caught up in IRC.)

I haven't found any smoking guns, but I did find some evidence that there was a transient load issue on the db shard for your instance. I'm not exactly sure what the direct cause was, but your instance is recent enough that the shard it's on is still open to new allocations, which tends to mean you get more load-related churn (from test instances coming up and down, and from new instances doing repository imports, which are resource intensive).

The shards are in the rough vicinity of full anyway, so I'll put some new hardware in the pool and swap over the open shards. That should generally have a stabilizing effect. I'm not certain that will actually fix whatever the issue was, but it can't hurt, and should stop the problem if it was general load bleed from other activity by instances using the same resources.

There are a couple of I/O spikes on the DB chart I'll look into once that's done too -- they're brief, but maybe we have an instance doing something sketchy/abusive.

I'm going to expand the web pool soon, too, just as a general headroom/burst-resistance measure, but load there looks good and this doesn't immediately appear to be related to any sort of web-tier resource exhaustion.

epriestley mentioned this in T9581: Expand Phacility cluster repo and db tiers (add nodes: 004).Oct 15 2015, 6:38 PM

epriestley closed subtask T9581: Expand Phacility cluster repo and db tiers (add nodes: 004) as Resolved.Oct 15 2015, 7:17 PM

Unsatisfying conclusion here, but:

I wasn't able to directly observe or reproduce the issue or conclusively identify a root cause.
My best guess, based on circumstantial evidence and gut feelings, is that this was load on the repo tier spilling over to the db tier and impacting writes from the web tier.
Operating under this assumption, I took steps to reduce load on the repo/db nodes for your instance, which were nearing full and would have been cycled out soon anyway (likely Saturday during weekly maintenance).
- I closed your shards to new allocations and opened new ones (see T9581).
- I pruned some instances and cycled processes on repo003 to ward off evil spirits.
- Load on the shard should also drop over the next 48 hours (as test instances expire) and continue to drop over the next ~45 days (as instances created with the "New Standard Instance" button that are really test instances get pruned).
I looked at other instances on these nodes for indicators of abusive/sketchy behavior, but didn't immediately find anything that seems out of line.
I've issued a 24 hour service credit for the disruption. This should be reflected on your next invoice.
Please let us know if you experience further issues. Although I think the odds are in our favor that things will be stable now, I could easily be on the wrong track here or have missed something.

Slow POST/PUTsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Slow POST/PUTs
Closed, ResolvedPublic
Actions

Related Objects
Search...