Daemons queue usage is low leading to slow imports
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aubort
	Aug 31 2016, 3:47 PM

Description

Hello,

We're running an instance of Phabricator for swiss universities and have some problem with slow imports for large repository (1000+ commits). Here's for instance what happens, the queue has a lot of tasks but the queue usage is very low, usually around 20-30%.

Here's the result of phd log, http://pastebin.com/9BN7ncSY
The only thing I see is some errors because some people didn't correctly set up import from github. On our test instance, we don't have those error but still have the import issue.

It happens on both our production instances and test, which are running the following version,

prod
phabricator 9da15fd7ab7071e3b7361463229abd4e918d9d94
libphutil 5fd2cf9d5ddd38424a54a8fba02398d527639970
arcanist f1c45a3323ae20eefe29c0a22c7923fe8b151bbf

test
phabricator be84c051e5ba05296be4070bdfbf6f1fdd44355b
libphutil d6818e59c1764ede22cad56f8c5b1b527cb6a577
arcanist 10e5194752901959507223c01e0878e6b8312cc5

We did our install before the cluster configuration was a thing, so we did our own infra which consist of,

app instances without daemons
daemons on dedicated instance (one instance for now)
mysql galera using only one node as master and other nodes as backup (via haproxy)
repo files shared accross app instances using glusterfs fuse client

So far we have no other problem, performances are very good.

Thanks a lot for any idea,
Regards

EDIT: An other example, I just pushed a new repo of 620 commits, with 2 phd instances

Related Objects

Mentioned In: T11571: Consider benchmarking hardware during setup
Mentioned Here: T8783: Unprototype Nuance (v1)
T11569: Nuance is fataling with calls to setIcon()
rARCf1c45a3323ae: (stable) Promote 2016 Week 27
rPHU5fd2cf9d5ddd: (stable) Promote 2016 Week 28
rP9da15fd7ab70: (stable) By default, do not save queries when executing Conduit "*.search" calls
rPbe84c051e5ba: (stable) Promote 2016 Week 35
rARC10e519475290: (stable) Promote 2016 Week 35
rPHUd6818e59c176: (stable) Promote 2016 Week 35

Event Timeline

aubort created this task.Aug 31 2016, 3:47 PM

aubort updated the task description. (Show Details)Aug 31 2016, 3:51 PM

aubort updated the task description. (Show Details)Aug 31 2016, 3:57 PM

chad added a project: Diffusion.Aug 31 2016, 4:54 PM

Someone has configured a bad Nuance source, so let's start by fixing that.

Nuance is a very early prototype application (see T8783) which doesn't do anything useful yet, so it is likely that this was just someone testing things or clicking buttons. If you want to make sure, you can try to find the source in the web UI by navigating to Applications → All Applications → Nuance, then browsing to "Sources" and examining what has been created.

I fixed a bug with this (T11569) a few moments ago, so you may hit a related fatal (bad call to setIcon()). If you do, updating should fix it.

You can prevent new Sources from being created by accident by uninstalling Nuance Applications → All Applications → Nuance (Configuration) → Uninstall Application.

Assuming you're able to establish that the Source isn't important, destroy it by running this command:

This will destroy all Sources permanently, and can not be undone.

mysql> TRUNCATE phabricator_nuance.nuance_source;

(There's no less-extreme way to clean it up yet, and it doesn't implement DestrictibleInterface so you can't bin/remove destroy it.)

After using TRUNCATE to destroy all sources, restart the daemons, let them run for a few minutes, and show me the new output of bin/phd log?

I think you may have answered to the wrong task :)

Oh, no, this is the right task. Your logs:

http://pastebin.com/9BN7ncSY

...are spewing this error over and over again:

Daemon 25174 STDO [Wed, 31 Aug 2016 17:36:32 +0200] <cursor:events.repository> Polling GitHub Repository API endpoint "/repos///events".
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:32 +0200] [2016-08-31 17:36:32] EXCEPTION: (Exception) You must setAccessToken()! at [<phutil>/src/future/github/PhutilGitHubFuture.php:49]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:32 +0200] arcanist(head=stable, ref.stable=f1c45a3323ae), phabricator(head=stable, ref.stable=9da15fd7ab70), phutil(head=stable, ref.stable=5fd2cf9d5ddd)
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:32 +0200]   #0 PhutilGitHubFuture::getProxiedFuture() called at [<phutil>/src/future/FutureProxy.php:34]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:32 +0200]   #1 FutureProxy::resolve() called at [<phabricator>/src/applications/nuance/cursor/NuanceGitHubImportCursor.php:98]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:32 +0200]   #2 NuanceGitHubImportCursor::pullDataFromSource() called at [<phabricator>/src/applications/nuance/cursor/NuanceImportCursor.php:72]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:33 +0200]   #3 NuanceImportCursor::importFromSource() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTriggerDaemon.php:446]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:33 +0200]   #4 PhabricatorTriggerDaemon::updateNuanceImportCursors() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTriggerDaemon.php:396]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:33 +0200]   #5 PhabricatorTriggerDaemon::runNuanceImportCursors(integer) called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTriggerDaemon.php:106]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:33 +0200]   #6 PhabricatorTriggerDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:184]
Daemon 25174 STDE [Wed, 31 Aug 2016 17:36:33 +0200]   #7 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]
Daemon 25174 FAIL [Wed, 31 Aug 2016 17:36:33 +0200] Process exited with error 255.
Daemon 25174 WAIT [Wed, 31 Aug 2016 17:36:33 +0200] Waiting to restart process.
Daemon 25174 INIT [Wed, 31 Aug 2016 17:36:38 +0200] Starting process.

This error is not because of a misconfigured GitHub repository, it is because of a misconfigured Nuance source (note NuanceImportCursor in the stack trace).

You should fix the Nuance Source first, then we can look at your logs again and try some other things.

Ah yes sorry, I saw some other reference to Nuance today and thought it was for that.

Sooo, indeed I may have tested Nuance in april, and let this as is. I've now removed the entries from the db, and the error is gone. I've pushed a new repo. Here's the logs.

Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:49 +0200] arcanist(head=stable, ref.stable=f1c45a3323ae), phabricator(head=stable, ref.stable=9da15fd7ab70), phutil(head=stable, ref.stable=5fd2cf9d5ddd)
Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:50 +0200]   #0 <#4> ExecFuture::resolvex() called at [<ph...' at [/srv/www/facebook/phabricator/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:410]
Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:50 +0200] arcanist(head=stable, ref.stable=f1c45a3323ae), phabricator(head=stable, ref.stable=9da15fd7ab70), phutil(head=stable, ref.stable=5fd2cf9d5ddd)
Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:50 +0200]   #0 phlog(string) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:410]
Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:50 +0200]   #1 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(PhabricatorRepository, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:203]
Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:51 +0200]   #2 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:184]
Daemon 25348 STDE [Wed, 31 Aug 2016 21:00:51 +0200]   #3 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]
Daemon 25354 INIT [Wed, 31 Aug 2016 21:12:05 +0200] Starting process.
Daemon 25354 DONE [Wed, 31 Aug 2016 21:12:27 +0200] Process exited normally.
Daemon 25356 INIT [Wed, 31 Aug 2016 21:17:47 +0200] Starting process.
Daemon 25358 INIT [Wed, 31 Aug 2016 21:17:52 +0200] Starting process.
Daemon 25360 INIT [Wed, 31 Aug 2016 21:17:57 +0200] Starting process.
Daemon 25362 INIT [Wed, 31 Aug 2016 21:18:06 +0200] Starting process.
Daemon 25364 INIT [Wed, 31 Aug 2016 21:18:13 +0200] Starting process.
Daemon 25356 DONE [Wed, 31 Aug 2016 21:19:55 +0200] Process exited normally.
Daemon 25366 INIT [Wed, 31 Aug 2016 21:20:09 +0200] Starting process.
Daemon 25368 INIT [Wed, 31 Aug 2016 21:21:22 +0200] Starting process.
Daemon 25364 DONE [Wed, 31 Aug 2016 21:21:49 +0200] Process exited normally.
Daemon 25358 DONE [Wed, 31 Aug 2016 21:23:32 +0200] Process exited normally.
Daemon 25370 INIT [Wed, 31 Aug 2016 21:23:36 +0200] Starting process.
Daemon 25360 DONE [Wed, 31 Aug 2016 21:25:46 +0200] Process exited normally.

Okay. Can you run this command?

phabricator/ $ ./bin/phd debug taskmaster --trace

Let that run for a minute or two and put a copy of the output up somewhere so I can look at it?

Yes, here it is.

http://pastebin.com/A0mjY6ez

...
13. >>> [4] <query> INSERT INTO `daemon_log` ...
14. <<< [4] <query> 870,858 us
...

This query took almost 900ms. If I run the same test on this server, I get:

...
>>> [4] <query> INSERT INTO `daemon_log` ...
<<< [4] <query> 2,978 us
...

Less than 3ms. In other words, write queries in your environment appear to be about 300 times slower than write queries in our environment.

I'd guess some component of your custom infrastructure is making an extreme performance tradeoff.

Okay, seems bad indeed, do you think it might explain why the queue is capped at 20-30% ?

Thanks a lot for the debug, I think we might switch to mysql replication now that it's getting supported.

I'm not entirely sure, but it seems plausible. We compute "Queue Utilization" as basically:

(Time Spent Doing Tasks) / (Time Spent Running)

I believe bookkeeping work between tasks doesn't count as "doing tasks", so it seems plausible that the daemons are spending 70-80% of their time waiting for bookkeeping writes to synchronize (or whatever else is going on with those slow queries) in between tasks.

I'd guess the tasks themselves are also executing at a glacially slow pace, but can't tell from that output on its own. You could stop the daemons, grab a task ID from the "Next In Queue" table, then run this command to execute it in isolation in the foreground:

phabricator/ $ ./bin/worker execute --id <id>

That will provide task query timings, but I suspect they'll all just also be slow and not really provide us any new information.

A possible diagnostic test is to stop the daemons, then run bin/phd debug taskmaster --trace and see if things are a lot faster. If they are, the slowdown is probably contention-related rather than latency-related. But if the intended fix is to use traditional replication anyway, it's probably not too important to understand the particulars of Galera's performance.

I'm going to close this for now since I don't plan to take any further action as the upstream; this appears to be an environmental issue (which we don't offer support for) rather than a scalability limitation (which we do).

If you fix the query performance and are still seeing issues, feel free to file a new bug describing what you run into, but try to make sure the problem reproduces in a "normal" environment first to rule out any custom infrastructure.

It's difficult for us to provide absolute performance numbers because workloads and hardware vary so widely, but Phabricator should import repositories in the thousands-of-commits range relatively quickly when configured properly. A single Phabricator host with reasonable specs should be able to handle thousands of repositories with millions of commits. If you're seeing performance that's more than an order of magnitude worse than this, there's likely some configuration or environmental issue.

(I'll file a followup to this about possibly doing benchmarking during setup, which might let us detect this sort of thing sooner.)

epriestley mentioned this in T11571: Consider benchmarking hardware during setup.Sep 1 2016, 2:24 PM

Okay, thank you for the pointers, we have indeed a huge performance improvement using standard mysql replication, which seems to increase the queue usage.

It's probably due to how galera is working, maybe we could have improved performance with galera but it's logicial to use master/slave replication with the current phabricator version.

+1 for integrated benchmarking

Daemons queue usage is low leading to slow importsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Daemons queue usage is low leading to slow imports
Closed, ResolvedPublic
Actions