Debugging performance issues
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	joshuaspence
	May 15 2014, 8:22 AM

Description

I'm currently trying to troubleshoot some performance issues that we are experiencing with our Phabricator setup. I have a few things that I want to check, but I would appreciate any tips on getting to the bottom of this.

We recently moved our Phabricator install onto a decidated host. We now have two hosts dedicated to Phabricator, one running daemons and the other running the application itself (plus aphlict). Our install roughly consists of the following:

Ubuntu 12.04
PHP v5.3
php-fpm running in dynamic mode.
Aphlict is running on NodeJS v0.10.
Syntax highlighting with pygmentize v0.6.
Both hosts are running on c1.xl EC2 instances.

Our new host has been up for 8 days now and it seemed to be really struggling today. I noticed that all 8 cores were running at basically 100% and there were occasional 502s occurring. At the time, we had around 200 concurrent users (according to /notificaton/status/.

At one stage, the load from php-fpm dropped down momentarily, but there was two pygmentize processes running at 100%.

I need to do some more investigating, but I'm suspecting one of the following issues:

pygmentize is being called excessively and on large files (is there any cache in front of syntax highlighting?)
Aphlict leaking resources perhaps (we have had issues with NodeJS previously)
Misconfiguration of PHP/nginx.

Event Timeline

joshuaspence created this task.May 15 2014, 8:22 AM

joshuaspence assigned this task to epriestley.

joshuaspence raised the priority of this task from to Needs Triage.

joshuaspence updated the task description. (Show Details)

joshuaspence added a project: Phabricator.

joshuaspence added a subscriber: joshuaspence.

Some possible issues:

The notification server is wildly unscalable right now (T4324). However, I wouldn't normally expect this to be able to burn 8 cores. You could try disabling this for a day and see if things improve?
Pygments is incredibly slow and can be cached / disabled / better managed in many cases. We can look at doing this in the short term. I also think we don't time it out, and have seen some runaway behavior. I can lock this down immediately.
You can install XHProf and use the profiler to see where time is spent on individual pages. However, this may be of limited utility in assessing the impact of general performance issues.
You can use debug.profile-rate to collect random samples to get a sense of overall use. However, if the box is essentially maxed out a lot of wait time will probably show up in profiles, diluting their value.
You can enable the access log and check for unusually slow / high rate / unusual looking requests using the %T (and maybe %i and %o) patterns.
You can check the Conduit call logs to see if anyone has written some kind of abusive script that polls all tasks every 30 seconds or something.
We can start building other kinds of sampling tools on top of top of ServiceProfiler, etc, to provide a better view of service costs.
We can split across multiple web frontends, but we should hold off on this until exhausting other approaches.

My intuition is:

Disable the notification server for a bit and see if things clear up.
Check the access log and conduit log and see if anything jumps out at you.
I'll pursue improving our management of Pygments.
Maybe poke around XHProf but I suspect this won't be fruitful in attacking this sort of issue.
If none of those pan out, we can build a service profile aggregator.

Thanks, I forgot that Aphlict worked that way. That would definitely be causing a fair bit of traffic for us (probably more than Phabricator itself). I'll try turning it off and see what happens.

I did have debug.profile-rate set previously and noticed that most of the stuff that was being profiled was /notification/individual/, which I assume is Aphlict. Does this mean that Aphlict's polling is actually running through the whole PHP stack as opposed to communicating with the NodeJS server directly?

Thanks, I forgot that Aphlict worked that way. That would definitely be causing a fair bit of traffic for us (probably more than Phabricator itself). I'll try turning it off and see what happens.

Yes. I think the way notifications work is reasonable, but it's stuck at a local minimum right now of complete terribleness. Specifically:

Node has no knowledge of authentication or policies. This means it's very simple, which is good, and we don't have to duplicate any code or have PHP and JS implementations of anything. It also doesn't need to talk to the database, etc. It's just a dumb pipe that we dump messages into.
However, this means we can't send any real data over it, since it would be transmitted to any client claiming to be a user. Instead, we only transmit IDs: "object X has been updated".
When the browser receives an update, it asks Phabricator for the actual details, and ignores them if it doesn't have permission to see them or they aren't relevant.

All reasonable, but we also send every message to every client, so overall sending messages scales as O(N^2): each action by each user must be transmitted to every other user.

There is no technical reason for things to work like this (we have plenty of information prior to getting to Node about who should receive messages), it was just moderately simpler for student-interns to implement as a v1 a couple of years ago and has held up fine for most installs since then.

There's also no sharing in the browser, so it's actually each tab, not just each browser. We had some apparently-reasonable code to deal with this originally (letting all the tabs on a machine elect a leader using LocalConnection, so the entire machine maintained one outbound pipeline) but there was some issue or difficulty with it so it didn't make it in, and there was also nothing put in to compensate for its removal.

Lots of /notification/individual/ isn't necessarily a smoking gun for this, but definitely makes it sound like a really good bet.

D9171 fixed an issue where the Differential markup cache would not fill properly in some cases. This could lead to pygmentize running too often.

✘ Merged into T4324.

joshuaspence renamed this task from Debugging performance issues. to Debugging performance issues.Jul 10 2014, 9:03 PM

Debugging performance issuesClosed, DuplicatePublicActions

Description

Event Timeline

Debugging performance issues
Closed, DuplicatePublic
Actions