Page MenuHomePhabricator

Maniphest Reports break when you have 75k tasks
Closed, ResolvedPublic

Description

If you go to https://phabricator.wikimedia.org/maniphest/report/ you get

Request aborted by debug time limit after 30 seconds.

STACK TRACE
ManiphestTask.php:153 PhabricatorStartup->onDebugTick()
ManiphestTaskQuery.php:349 ManiphestTask->attachProjectPHIDs()
PhabricatorPolicyAwareQuery.php:260 ManiphestTaskQuery->didFilterPage()
ManiphestReportController.php:409 PhabricatorPolicyAwareQuery->execute()
ManiphestReportController.php:46 ManiphestReportController->renderOpenTasks()
AphrontController.php:33 ManiphestReportController->processRequest()
index.php:103 AphrontController->handleRequest()

This was working before our Bugzilla migration, when we had less than 2000 tasks. Now we have more than 75k.

Event Timeline

qgil raised the priority of this task from to Needs Triage.
qgil updated the task description. (Show Details)
qgil added projects: Maniphest, Wikimedia.
qgil added a subscriber: qgil.

In ManiphestTaskQuery it says:

// TODO: Eventually, we should make this optional and introduce a
// needProjectPHIDs() method, but for now there's a lot of code which
// assumes the data is always populated.

Doing this would fix at least the "by user" report.
by project would still be broken as the query on the edge table refuses to use the index (which would cover all the fields needed) when more than ~ 20 tasks are in the 'IN' clause.
simply only selecting edge.dst (which is everything needed for this case) instead of edge.* would make mysql use the covering index and make the query very fast for both cases.
However this would need a change to "PhabricatorEdgeQuery" to allow for only retrieving certain fields and i'm not sure how to implement this properly as i'm not sure if something like this is done elsewhere already.
epriestly should be able to do this in 2 minutes ;)

(Please bare with me, I'm ignorant when it comes to Phabricator internals.)

Currently, when users click Reports, they trigger what is probably the most expensive query of all: https://secure.phabricator.com/maniphest/report/. Maybe they are not even interested in that "show me everything about everything" report, but on something more specific related to their specific projects. On queries that perhaps won't break even if you have more than 75k tasks in your installation?

If this assumption is correct, the problem would still be there, but at least it would become less of a crasher than now.

This comment was removed by fabe.

If the reports feature is what the user expects to find or is helpful is a whole different question.

Nonetheless i think the edge query should be fixed, as it won't use the covering index in many cases because of the '.*' and this might affect any place where an edge query is used.

T4171 is the real solution to this, and we're unlikely to put much effort into the reports page (eventually, we would intend to replace it with the system built by T4171).

I don't see a significant difference in query performance between selecting edge.* and edge.dst in this query. Both take ~1ms on this install and both appear to use a key (primary in one case; the similar <src, type, dateCreated, seq> key in the other) and examine the correct number of rows. Is this the query you're talking about?

mysql> EXPLAIN SELECT edge.* FROM `edge` edge WHERE (edge.src IN (<95 PHIDs>)) AND (edge.type IN ('41')) ORDER BY edge.dateCreated DESC, edge.seq DESC;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | Extra                       |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | edge  | range | PRIMARY,src   | PRIMARY | 70      | NULL |  131 | Using where; Using filesort |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

Here's the plan I get for a simplified version:

mysql> EXPLAIN SELECT edge.dst FROM `edge` edge WHERE (edge.src IN (<95 PHIDs>)) AND (edge.type IN ('41'));
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
|  1 | SIMPLE      | edge  | range | PRIMARY,src   | src  | 70      | NULL |  131 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

We only have ~20K edge rows on this install, but I'd expect this to start to degrade at ~20K if it has become pathological at ~300K or whatever you have. That said, I'm far from an expert in this area so I may be misunderstanding what I'm seeing.

In the general case, if you can help me understand the edge issue, we can probably reduce all edge queries to only dst selection (and, if it's an issue, likely eliminate ORDER BY clauses). I think the only blocker is that Differential uses the data field, and we'd probably need to remove that first (this may be a good idea anyway, as it hasn't proven to be a good fit in other applications and is an iffy fit in Differential).

I am willing to accept patches to make the reports page even dirtier and hackier since it's on the chopping block in the long run anyway, but not so dirty and hacky that it violates policies, which means we'd need to do needProjectPHIDs() first anyway to let you opt out of edge selection. This might be moot with T4411, which would require some edge selection to evaluate policy rules (and will probably happen before T4171).

So I think some options are:

  • Walk me through the edge issue in more detail and we may be able to fix that globally; or
  • hack the page locally (or build some simple external page which queries the database directly) to do whatever cheaper SELECTs you want instead of using ManiphestTaskQuery. This will bypass policies, but you may be OK with users pulling stats on tasks they don't have permission to see (they don't really get any sensitive data); or
  • build (or get us to build) needProjectPHIDs() and then I'm OK with upstreaming a hack for edge selection, but it's likely to regress after T4411; or
  • wait for T4171 to solve this properly.

Ok, i've created ~10k tasks in my dev installation to test this more thoroughly.
I still see the full table scan on the edge table (tested on mysql 5.1 & 5.6). E.g.:

mysql> EXPLAIN SELECT edge.* FROM `edge` edge WHERE (edge.src IN (<~1000 TASK PHIDS>)) AND (edge.type IN ('41')) ORDER BY edge.dateCreated DESC, edge.seq DESC;
+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                       |
+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | edge  | ALL  | PRIMARY,src   | NULL | NULL    | NULL | 3795 | Using where; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

Same result if i select: edge.src, edge.type, edge.dst, edge.dateCreated, edge.seq, edge.dataID

However just leaving out the dataID yields:

EXPLAIN SELECT edge.src, edge.type, edge.dst, edge.dateCreated, edge.seq FROM `edge` edge WHERE (edge.src IN (<~1000 TASK PHIDS>)) AND (edge.type IN ('41')) ORDER BY edge.dateCreated DESC, edge.seq DESC;
+----+-------------+-------+-------+---------------+------+---------+------+------+------------------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                                    |
+----+-------------+-------+-------+---------------+------+---------+------+------+------------------------------------------+
|  1 | SIMPLE      | edge  | range | PRIMARY,src   | src  | 70      | NULL | 1101 | Using where; Using index; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+------+------------------------------------------+
1 row in set (0.01 sec)

I cannot reproduce the covering index query on mysql 5.6 so i'll ignore that for now.
However both queries execute in nearly the same time (~40ms on my edge table with ~4000 rows) so this isn't actually the problem.
I just saw the full table scan and jumped on it as usually this is the culprit. sorry about that.

When digging deeper i found that with 10.000 open tasks in my installation it will break with a 30sec timeout.
The reason might be that some parts will select ALL open tasks and then loop through all 10.000 in didFilterPage and calling getDestinationPHIDs which will in turn loop through all 10.000 again...
I think the culprit is the loadStatus function (the counter in the menu?).

XHProf Profile
Symbol	Count	Wall Time	%
Metrics for this Call			
PhabricatorEdgeQuery::getDestinationPHIDs	10,080	66,964,690 us	100.0%
Parent Calls			
ManiphestTaskQuery::didFilterPage	10,080	66,964,690 us	
Child Calls			
array_keys	10,080	44,137 us	0.1%
array_fill_keys	20,160	10,370 us	0.0%

XHProf Profile
Symbol	Count	Wall Time	%
Metrics for this Call			
PhabricatorManiphestApplication::loadStatus	1	68,888,073 us	100.0%
Parent Calls			
PhabricatorHomeController::buildNav	1	68,888,073 us	
Child Calls			
PhabricatorPolicyAwareQuery::execute	1	68,845,735 us	99.9%

My rather unusual testcase will trigger this consistently everywhere, whereas otherwise it seems only to be triggered on the reports.
needProjectPHIDs would probably fix this for some cases. However not for all.
I'll take a look at the getDestinationPHIDs function. Maybe there's a way to simply make this faster.

If you're a little out of date, we fixed one loadStatus() problem in rPcd40694. Users should no longer load more tasks than are actually assigned to them.

T6595 is a followup to load no more than a constant number of tasks. You might be experiencing the issue it seeks to address if your test tasks are all assigned to the same test user, but we haven't seen this issue in the wild yet (specifically, WMF doesn't have 10K tasks assigned to a single user, for now).

This shouldn't have much impact on reports, though, since it should only be executed on the home page.

My dev env is up2date.
But you're right that all the tasks are assigned to one user and that causes the problem within loadStatus on every page.
However as far as i can see in the reports the getDestinationPHIDs is called for every user there is and therefore causes the same problem there.
I've rearranged the way getDestinationPHIDs works a bit and it went from 60s to a 2s load time (for loadStatus as well). Which is still not great but might be acceptable for a reports page or a user who has 10k assigned tasks.

With the patch, the reports page generates in ~500ms now on this install with 6.5K tasks, so I'd expect it to take ~6s to generate on WMF with 75K tasks if everything scales linearly (which it hopefully should). Still not great, but way better than totally broken. Let me know if you actually see that sort of runtime?

Great catch on the getDestinationPHIDs() issue, in any case.

Today we have upgraded, and we got the reports back! https://phabricator.wikimedia.org/maniphest/report/

You can still feel how the server is sweating, but after some seconds data and graphs come up. Thank you!

qgil claimed this task.