Page MenuHomePhabricator

Improve behaviors for "overheated" policy queries
Open, NormalPublic

Description

Details from reporter:

  • Configured things with exclusive spaces, so all existing objects are not visible to unprivileged users.
  • Logged in as an unprivileged user.
  • Got a timeout after 30 seconds.

My theory is that this is in FeedQuery, although it could be elsewhere or could be in multiple places.

Event Timeline

epriestley updated the task description. (Show Details)Oct 19 2016, 9:52 PM

Here are slightly cleaned-up recent timeouts from the logs:

[Wed Oct 19 20:48:18.831680 2016] /core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:118
[Wed Oct 19 20:50:59.447655 2016] /core/lib/phabricator/src/infrastructure/storage/lisk/LiskDAO.php:1433
[Wed Oct 19 21:18:01.163732 2016] /core/lib/phabricator/src/applications/phid/type/PhabricatorPHIDType.php:147
[Wed Oct 19 21:18:01.787669 2016] /core/lib/libphutil/src/parser/PhutilTypeSpec.php:1918
[Wed Oct 19 20:46:50.241065 2016] /core/lib/libphutil/src/parser/PhutilTypeSpec.php:1918
[Wed Oct 19 20:50:23.741144 2016] /core/lib/phabricator/src/infrastructure/query/PhabricatorQuery.php:24
[Wed Oct 19 20:51:14.985073 2016] /core/lib/libphutil/src/parser/PhutilTypeSpec.php:1404
[Wed Oct 19 20:58:59.953077 2016] /core/lib/phabricator/src/infrastructure/query/policy/PhabricatorPolicyAwareQuery.php:213
[Wed Oct 19 21:17:17.845140 2016] /core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:166
[Wed Oct 19 21:17:28.181119 2016] /core/lib/libphutil/src/parser/PhutilParserGenerator.php:869
[Wed Oct 19 21:17:32.745066 2016] /core/lib/phabricator/src/infrastructure/storage/lisk/LiskDAO.php:584

The TypeSpec/ParserGenerator ones are not consistent with the feed hypothesis since I believe Feed never hits that code, but might possibly be coming out of Harbormaster lint/unit stuff. I suspect that's some unrelated issue.

The LiskDAO/DatabaseConnection/PolicyAwareQuery/Query/PHIDType ones are consistent with some sort of query issue, although none point specifically at feed. PhabricatorPHIDType sort of points at Feed a bit since most Query stacks do not load handles directly -- while Feed does -- but isn't a smoking gun.

So none of this is particularly conclusive.

The reporting user configured debug.time-limit to get a stack, which confirms this is a Feed issue:

LiskDAO.php:608 PhabricatorStartup->onDebugTick()
LiskDAO.php:655 LiskDAO->loadFromArray()
PhabricatorApplicationTransactionCommentQuery.php:57 LiskDAO->loadAllFromArray()
PhabricatorPolicyAwareQuery.php:227 PhabricatorApplicationTransactionCommentQuery->loadPage()
PhabricatorApplicationTransactionQuery.php:99 PhabricatorPolicyAwareQuery->execute()
PhabricatorPolicyAwareQuery.php:227 PhabricatorApplicationTransactionQuery->loadPage()
PhabricatorApplicationTransactionTransactionPHIDType.php:74 PhabricatorPolicyAwareQuery->execute()
PhabricatorObjectQuery.php:143 PhabricatorApplicationTransactionTransactionPHIDType->loadObjects()
PhabricatorObjectQuery.php:63 PhabricatorObjectQuery->loadObjectsByPHID()
PhabricatorPolicyAwareQuery.php:227 PhabricatorObjectQuery->loadPage()
PhabricatorFeedStory.php:84 PhabricatorPolicyAwareQuery->execute()
PhabricatorFeedQuery.php:37 PhabricatorFeedStory->loadAllFromRows()
PhabricatorPolicyAwareQuery.php:236 PhabricatorFeedQuery->willFilterPage()
PhabricatorCursorPagedPolicyAwareQuery.php:177 PhabricatorPolicyAwareQuery->execute()
PhabricatorApplicationSearchEngine.php:974 PhabricatorCursorPagedPolicyAwareQuery->executeWithCursorPager()
PhabricatorHomeMainController.php:203 PhabricatorApplicationSearchEngine->executeQuery()
PhabricatorHomeMainController.php:85 PhabricatorHomeMainController->buildFeedPanel()
PhabricatorHomeMainController.php:34 PhabricatorHomeMainController->buildMainResponse()
AphrontApplicationConfiguration.php:257 PhabricatorHomeMainController->handleRequest()
AphrontApplicationConfiguration.php:169 AphrontApplicationConfiguration->processRequest()
index.php:17 AphrontApplicationConfiguration->runHTTPRequest()

Broadly, most applications can execute Spaces queries cheaply by querying ... AND spacePHID IN (<list of spaces the viewer can see>) when Spaces are configured, which is efficient until some install decides to create 30,000 Spaces for some reason.

Feed (and probably some other applications which we just haven't hit yet) don't have a spacePHID column on the primary table. In Feed, we query stories, then they get policy checked at the application level. Although the policy checks can do Spaces queries efficiently, this can still lead to a situation where we load the first page of results, filter it all out, load the next page of results, filter it all out, etc., until we hit a 30-second kill time limit.

Some things we can do to mitigate this, in roughly ascending order of complexity:

  • Add an "and it has to have happened in the last 7 days" clause to homepage feed.
  • As we load result pages without finding enough rows, start loading them in larger chunks. Currently, we're very conservative about the number of rows we pull when filling a result set, but should possibly start pulling more: 100, then 200, then 400, or similar. These degenerate cases would be more likely to resolve with a more aggressive fetch strategy.
  • Add a soft "examine no more than 1,000 rows" sort of limit to PolicyAwareQuery. Maybe in general? What does the error behavior look like here?
  • Denormalize spacePHID into Feed.
    • We could update it when objects shift between spaces (I believe this would be cheap to update), although this isn't exactly a general fix.
    • We also don't really need to update it, since the cost of having the wrong value is relatively small.
    • This should almost certainly happen, but probably requires a big migration? Maybe it's actually fine to add the row without setting it correctly, since few installs are likely to be affected.
epriestley renamed this task from Feed does not perform well in the presence of exclusive spaces? to Feed does not perform well in the presence of exclusive spaces.Oct 19 2016, 10:11 PM

This is also a potential problem in the general case, although the general case of this is very rare.

Specifically, if you don't use spaces and just set every object to have some bizarre, obscure, rarely-satisfied policy we can hit the same issue in any application. For example, if an install has a million tasks set to something that we can only filter in the application layer like "Visible To: Users with 3 consecutive vowels in their usernames" and a normal user who does not satisfy that condition loads the objects, they'll hit this same "load, discard, load, discard" loop.

In theory, you should probably use Spaces for applying restrictive visibility policies to very large numbers of objects in all cases. But sooner or later we'll almost certainly find a case where users haven't done this or don't want to or can't for whatever reason.

The best behavior here is probably this:

  • Load, 100, 200, 300, 400 rows (assuming requested page size is 100).
    • (The optimal number of rows to load the first time is probably slightly more than 100 but that optimization is not really related here.)
  • At this point, we've examined 1,000 rows. If we still don't have a full page because of application-level policy filtering (we may either have zero results or some number of results, but fewer than the requested page size), stop loading additional pages.
  • Render a message after the results, something like this, maybe two flavors of it (for "no results" and "fewer than requested results"):

Most objects matching your query are not visible to you, so filtering results takes a long time. Only some results are shown. Refine your query to find results more quickly, or continue to the next page to continue loading more results.

  • Pagination works normally/correctly, so you can go to the next page, even if the current page has no results.

I think this behavior is very good and represents a good general solution to this class of issue, but implementing it is a nontrivial -- but mostly for API/UI reasons, not because of any real technical difficulty. On the contrary, I think the technical stuff should already be on very firm footing -- cursor-based pagination is already widespread and solid.

The API/UI stuff may also not really be that bad. The issue I'm imagining is this sort of thing:

  • Applications do some random query with a limit.
  • The new "examine max rows" logic applies and the query returns no results.
  • How does the application distinguish between "no results because no results" vs "no results because of soft limit?
  • How does it show this new hint in a way that doesn't require us to copy/paste stuff all over the place?

But maybe this is actually very rare and never matters in practice:

  • All ApplicationSearch stuff can be trivially fixed.
  • Anything not using ApplicationSearch yet almost certainly isn't pressing, and will update eventually anyway.
  • Applications don't usually issue their own queries with limits outside of ApplicationSearch.
  • When they do, the objects are usually guaranteed to be visible (for example, lease/log information from Drydock or Calendar Imports).

So maybe this actually boils down to:

  • PolicyAwareQuery just implements this logic.
  • We return no/fewer results via the existing channel.
  • A new side channel allows callers to ask "Did I get fewer results back because of a soft limit on application filtering?"
  • ApplicationSearch uses that side channel.
  • Nothing else does (or maybe Conduit gets a small hint about it); maybe that's 100% fine for a long time.

I'm inclined to try implementing that rather than any of the stopgaps and see if I can get away with it, since it puts us on tremendously more solid footing if the situation is roughly in line with the complexity described above.

I think this behavior is very good

Specifically, it's "very good" from a technical point of view. It's not all that great from a user / API consumer point of view, but I think it's reasonable, better than fataling, and about the best we can do in the general case.

Pagination isn't quite as simple as I'd remembered. Specifically, we do not expose internal pagination cursors to viewers today because doing so permits an information discovery attack that goes like this:

  • You want to discover the title of task T123, but can not see it.
  • You create a task named M, then run the query ?after=123&sort=title.
  • Today, this will fail ("Invalid paging cursor."), but suppose we let it work.
  • If your M task is visible in the results, T123 has a title earlier in the alphabet than M.
  • If your M task is not visible in the results, T123 has a title later in the alphabet than M.
  • Make a task named G or U, repeat the process, etc.
  • Each request lets you discover one additional bit of information from the task title. You can also often guess most of the words in the title once you have the first few letters, so I think it's plausible to discover most tasks' complete titles in ~500 queries.
  • This will leave an audit record a mile wide since you have to create a task for every query so it's not useful for some types of attackers, but is useful for other attackers and unquestionably discloses private information.

D16735 implements a basic version of the mechanism described above (now called "overheating"), but without paging. It also doesn't do the "100, 200, 300, 400" progression, since that's entangled with paging.

This should make the worst case scenario much less bad, but isn't as user-friendly / API-friendly as it could be with a more functional pager, and efficiency could be improved with the limit progression.

I think we can implement the pager in most cases like this:

  • Allow sort orders to be marked as "safe" (meaning: for this sort, using internal paging does not disclose private information).
  • Queries probably need to be able to make additional adjustments to "safety", e.g. Maniphest's "Group By" is not a sort order per se but is disclosure-unsafe.
  • Mark all "id" orders as safe. Some other orders (e.g., creation date) are likely OK to mark safe too.
  • When an order is "safe", allow (and use) internal cursors to be used to page it.
  • When a query overheats, provide "...or click next page to continue searching" guidance to users if we generated a cursor.

This stuff is complex and I think this situation is very rare, so I don't expect to get to it any time soon. As a user or API client things won't change much when we do eventually support that fancy stuff.

Overall, D16735 should make the bad case here way way less bad (no longer unboundedly bad, and likely something in the realm of "pretty reasonable"). Everything else is just eventual refinements to slightly improve rare edge cases.

epriestley renamed this task from Feed does not perform well in the presence of exclusive spaces to Improve behaviors for "overheated" policy queries.Oct 20 2016, 5:08 PM
epriestley removed a project: Feed.