Page MenuHomePhabricator

Rebuilding repository identities throws exception for inability to identify a valid object in query "DiffusionCommitQuery"
Closed, ResolvedPublic

Description

Trying to resolve https://phabricator.example.com/config/issue/manual.identities/

$ ./bin/repository rebuild-identities --all chugs along for a while and then dies with the following:

[2019-02-27 21:25:10] EXCEPTION: (Exception) Cursor "71471" does not identify a valid object in query "DiffusionCommitQuery". at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:501]
arcanist(head=b4a302683b1aefbbb2ab9d1aaaf418b551b84b28, ref.master=25c2381959ac), phabricator(head=e6331ca8efc1f60e38f9ca13dc292f9bb00ea104, ref.master=afd2ace0dc9c), phutil(head=813a26a2d09758f3c407a0c99c6761f11f62cb90, ref.master=92413d0303ab)
  #0 PhabricatorCursorPagedPolicyAwareQuery::loadCursorObject(string) called at [<phabricator>/src/applications/diffusion/query/DiffusionCommitQuery.php:928]
  #1 DiffusionCommitQuery::getPagingValueMap(string, array) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:443]
  #2 PhabricatorCursorPagedPolicyAwareQuery::buildPagingClause(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:347]
  #3 PhabricatorCursorPagedPolicyAwareQuery::buildWhereClauseParts(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/applications/diffusion/query/DiffusionCommitQuery.php:529]
  #4 DiffusionCommitQuery::buildWhereClauseParts(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:337]
  #5 PhabricatorCursorPagedPolicyAwareQuery::buildWhereClause(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:140]
  #6 PhabricatorCursorPagedPolicyAwareQuery::buildStandardPageQuery(AphrontMySQLiDatabaseConnection, string) called at [<phabricator>/src/applications/diffusion/query/DiffusionCommitQuery.php:246]
  #7 DiffusionCommitQuery::loadPage() called at [<phabricator>/src/infrastructure/query/policy/PhabricatorPolicyAwareQuery.php:236]
  #8 PhabricatorPolicyAwareQuery::execute() called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:239]
  #9 PhabricatorCursorPagedPolicyAwareQuery::executeWithCursorPager(AphrontCursorPagerView) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorQueryIterator.php:28]
  #10 PhabricatorQueryIterator::loadPage() called at [<phutil>/src/utils/PhutilBufferedIterator.php:129]
  #11 PhutilBufferedIterator::next() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementRebuildIdentitiesWorkflow.php:46]
  #12 PhabricatorRepositoryManagementRebuildIdentitiesWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:457]
  #13 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:349]
  #14 PhutilArgumentParser::parseWorkflows(array) called at [<phabricator>/scripts/repository/manage_repositories.php:22]

Please let me know what other information is needed in order to debug this.

Event Timeline

reed created this task.Feb 28 2019, 8:32 AM
reed updated the task description. (Show Details)Feb 28 2019, 8:36 AM
reed updated the task description. (Show Details)

Here's what's happening in general:

  • rebuild-identities is iterating over every commit in order, basically doing SELECT * FROM commit ORDER BY id ASC LIMIT 100, then SELECT * FROM COMMIT WHERE id > X ORDER BY id ASC LIMIT 100, where X is the largest ID in the previous result set.
  • For some reason, when it tries to do SELECT * FROM COMMIT WHERE id > 71471 ..., it's failing to load the associated object.

The expectation is that this is mostly-impossible, and we should only ever use an ID as a cursor if we previously loaded the object. This "should" only be able to happen if the script selects page N, you very quickly delete the last object on page N while it's processing the page, and then it fails to select page N+1.

Looking at exactly what Commit #71471 is might be helpful:

SELECT * FROM phabricator_repository.repository_commit WHERE id = 71471;

Is that empty, or a valid commit?

If it's a valid commit: if you go find it in the web UI (e.g., plug the hash into the global search box), does it appear to be a perfectly normal object? Or is it unusual in some way (page 404's, commit is missing pieces or has some kind of error state)?

reed added a comment.Feb 28 2019, 9:28 PM
mysql> SELECT * FROM phabricator_repository.repository_commit WHERE id = 71471;
+-------+--------------+--------------------------------+------------------------------------------+------------+------------+-------------+---------+--------------+--------------------+-----------------------+
| id    | repositoryID | phid                           | commitIdentifier                         | epoch      | authorPHID | auditStatus | summary | importStatus | authorIdentityPHID | committerIdentityPHID |
+-------+--------------+--------------------------------+------------------------------------------+------------+------------+-------------+---------+--------------+--------------------+-----------------------+
| 71471 |           41 | PHID-CMIT-ohupojjoniypwpghvlr2 | 43034ccb9eac4ca8a655f655b726f73c5e551413 | 1438182566 | NULL       | none        |         |         1024 | NULL               | NULL                  |
+-------+--------------+--------------------------------+------------------------------------------+------------+------------+-------------+---------+--------------+--------------------+-----------------------+
1 row in set (0.01 sec)

Page loads and seems to be OK (it's a random commit to a temporary branch somebody was using). I've e-mailed you a screenshot in case there's something obvious I'm missing.

Thanks. That looks like a perfectly normal object to me so I'm not sure what could cause it to fail to load.

If you re-run ./bin/repository rebuild-identities --all, does it die at the same point reliably? (The script is idempotent and safe to re-run as often as you care to.)

reed added a comment.Feb 28 2019, 10:10 PM

It's now failing on this one (consistently if I keep running it):

[2019-02-28 14:08:00] EXCEPTION: (Exception) Cursor "71484" does not identify a valid object in query "DiffusionCommitQuery". at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:501]
arcanist(head=b4a302683b1aefbbb2ab9d1aaaf418b551b84b28, ref.master=25c2381959ac), phabricator(head=e6331ca8efc1f60e38f9ca13dc292f9bb00ea104, ref.master=afd2ace0dc9c), phutil(head=813a26a2d09758f3c407a0c99c6761f11f62cb90, ref.master=92413d0303ab)
  #0 PhabricatorCursorPagedPolicyAwareQuery::loadCursorObject(string) called at [<phabricator>/src/applications/diffusion/query/DiffusionCommitQuery.php:928]
  #1 DiffusionCommitQuery::getPagingValueMap(string, array) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:443]
  #2 PhabricatorCursorPagedPolicyAwareQuery::buildPagingClause(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:347]
  #3 PhabricatorCursorPagedPolicyAwareQuery::buildWhereClauseParts(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/applications/diffusion/query/DiffusionCommitQuery.php:529]
  #4 DiffusionCommitQuery::buildWhereClauseParts(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:337]
  #5 PhabricatorCursorPagedPolicyAwareQuery::buildWhereClause(AphrontMySQLiDatabaseConnection) called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:140]
  #6 PhabricatorCursorPagedPolicyAwareQuery::buildStandardPageQuery(AphrontMySQLiDatabaseConnection, string) called at [<phabricator>/src/applications/diffusion/query/DiffusionCommitQuery.php:246]
  #7 DiffusionCommitQuery::loadPage() called at [<phabricator>/src/infrastructure/query/policy/PhabricatorPolicyAwareQuery.php:236]
  #8 PhabricatorPolicyAwareQuery::execute() called at [<phabricator>/src/infrastructure/query/policy/PhabricatorCursorPagedPolicyAwareQuery.php:239]
  #9 PhabricatorCursorPagedPolicyAwareQuery::executeWithCursorPager(AphrontCursorPagerView) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorQueryIterator.php:28]
  #10 PhabricatorQueryIterator::loadPage() called at [<phutil>/src/utils/PhutilBufferedIterator.php:129]
  #11 PhutilBufferedIterator::next() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementRebuildIdentitiesWorkflow.php:46]
  #12 PhabricatorRepositoryManagementRebuildIdentitiesWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:457]
  #13 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:349]
  #14 PhutilArgumentParser::parseWorkflows(array) called at [<phabricator>/scripts/repository/manage_repositories.php:22]
mysql> SELECT * FROM phabricator_repository.repository_commit WHERE id = 71484;
+-------+--------------+--------------------------------+------------------------------------------+------------+------------+-------------+---------+--------------+--------------------+-----------------------+
| id    | repositoryID | phid                           | commitIdentifier                         | epoch      | authorPHID | auditStatus | summary | importStatus | authorIdentityPHID | committerIdentityPHID |
+-------+--------------+--------------------------------+------------------------------------------+------------+------------+-------------+---------+--------------+--------------------+-----------------------+
| 71484 |           41 | PHID-CMIT-oz42seiotzayiqgnexau | 170ba8ae76a5747410786c45b853e9ed231acfe7 | 1436263919 | NULL       | none        |         |         1024 | NULL               | NULL                  |
+-------+--------------+--------------------------------+------------------------------------------+------------+------------+-------------+---------+--------------+--------------------+-----------------------+
1 row in set (0.00 sec)

E-mailed you the commit again.

reed added a comment.Feb 28 2019, 10:16 PM

and now when I re-run the command again, it gets another 7 commits and fails again. So, perhaps by me viewing the commit, I kick off a job that rebuilds something on the backend?

From onsite:

  • The cursor issue is because a large number of commits with an invalid repositoryID were present in the database. This is unusual, an "unnatural" state, somewhat tricky to explain, and somewhat tricky to resolve, but should be reproducible by adding 100000 to every repositoryID on a local install with at least 100 commits, then running rebuild-identities. The data can be restored later by subtracting 100000 from each repository ID. The problem was worked around by destroying these commits.
  • The "Mangled Webserver Response" setup check may incorrectly detect that extra whitespace is being added to responses if they are actually being rejected outright by something like an LB/VPN/Auth layer. We should include the initial response bytes in this message to make this easier to identify, and probably specialize the message -- for example, by testing that it contains a large number of "Q" characters. If less than ~95% of the response body is "Q", there's a high probability that it's foreign response material, not a slightly-mangled self-test response.
  • Duo auth is sometimes raising a fatal from $control->setLabel() near line 665. This appears to possibly be related to having a disabled TOTP provider? I have a local reproduction case, in any case.
  • User feedback: literally every user was unable to to identify that the "Clickable" object item lists were clickable based on the UI.
epriestley added a comment.EditedMar 1 2019, 4:19 AM

somewhat tricky to explain, and somewhat tricky to resolve

I think this explanation isn't terribly clear, but here's an attempt to walk through what's going on here:

See T13133 for some context. On the client browser, we generally page results by using the last object the user can see on the page. For example, if page 3 has 100 results and the last result has ID 6789, the link to the next page will look like /?after=6789.

We use the last object the user can see -- not the last object we examined -- to avoid leaking information about how many objects match various queries.

However, when we're paging internally (say, to build page 4) we can't rely on visible objects. Suppose the user clicks "Next" on page 3 above, but the next 100 results after 6789 (6790 through 6889) are all invisible to them because their policies are set to "Only Alice". Internally, we must be able to issue a query for the page represented by /?after=6889 to continue examining results, even though the user could not issue this query directly and can not see object ID 6889.

We do this by switching from the real viewer to the omnipotent viewer when building and iterating cursors, e.g.:

/**
 * Get the viewer for making cursor paging queries.
 *
 * NOTE: You should ONLY use this viewer to load cursor objects while
 * building paging queries.
 *
 * Cursor paging can happen in two ways. First, the user can request a page
 * like `/stuff/?after=33`, which explicitly causes paging. Otherwise, we
 * can fall back to implicit paging if we filter some results out of a
 * result list because the user can't see them and need to go fetch some more
 * results to generate a large enough result list.
 *
 * In the first case, want to use the viewer's policies to load the object.
 * This prevents an attacker from figuring out information about an object
 * they can't see by executing queries like `/stuff/?after=33&order=name`,
 * which would otherwise give them a hint about the name of the object.
 * Generally, if a user can't see an object, they can't use it to page.
 *
 * In the second case, we need to load the object whether the user can see
 * it or not, because we need to examine new results. For example, if a user
 * loads `/stuff/` and we run a query for the first 100 items that they can
 * see, but the first 100 rows in the database aren't visible, we need to
 * be able to issue a query for the next 100 results. If we can't load the
 * cursor object, we'll fail or issue the same query over and over again.
 * So, generally, internal paging must bypass policy controls.
 *
 * This method returns the appropriate viewer, based on the context in which
 * the paging is occurring.
 *
 * @return PhabricatorUser Viewer for executing paging queries.
 */
final protected function getPagingViewer() {
  if ($this->internalPaging) {
    return PhabricatorUser::getOmnipotentUser();
  } else {
    return $this->getViewer();
  }
}

So we issue the first query (... WHERE id > 6789 LIMIT 100) as the real viewer, then switch to the omnipotent viewer to update the cursor (... WHERE id = 6889), move the cursor forward (usually -- but not always -- by just setting it to the new ID), then issue the next page query as the real user again (... WHERE id > 6889 LIMIT 100).

So far so good. The problem is:

  • We get the last object (6889) from the result set before we apply filtering. This is correct, and we have to do this: otherwise, 100+ objects in a row that were excluded by filtering would prevent us from moving the cursor forward.
  • However, we then load the object to update the cursor using a normal execute() as the omnipotent user.
  • Since we're the omnipotent user, this skips policy filtering. But it does not skip other types of filtering, including filtering because linked objects are invalid or can't load. Thus, it's possible for us to identify a last object in the result set which can not be loaded by a normal execute(). When this happens, we end up in a situation like the one above.

The fix is probably to skip all filtering on the cursor query.

This means that it needs to be possible to construct cursor values from raw/partial objects. It probably always is, but maybe not. This is also sort of conceptually bad. It would be nice to find a cleaner approach. But whatever we do sort of fundmentally has to be a mess because we need to be able to continue pushing the cursor forward even in the presence of an arbitrarily long run of broken/mangled objects which can't load fully.

The hypothetical steps above do, in fact, reproduce this:

mysql> UPDATE repository_commit SET repositoryID = repositoryID + 100000;
$ ./bin/repository rebuild-identities --all
[2019-03-07 09:36:22] EXCEPTION: (Exception) Cursor "16671" does not identify a valid object in query "DiffusionCommitQuery".
...

The first problem with short-circuiting the paging algorithm is that we have a willExecuteCursorQuery() callback with two concrete implementations:

  • PhrictionDocumentQuery needs to load content objects if part of the order is title.
  • PhabricatorRepositoryQuery needs a statistics table if part of the order is "recent commit" or "size".

We can denormalize document titles onto PhrictionDocument to fix the first case. This seems generally reasonable.

We can probably cheat our way through things in RepositoryQuery, since we can still implement reasonable behavior if the statistics table is missing rows.

Maybe a slightly different question is:

Do we need to call loadCursorObject() internally at all?

Currently, getPagingValueMap($cursor, array $keys) accepts a (string) cursor and a list of keys we need for it.

I need to double check this, but I think all implementations call $object = $this->loadCursorObject($cursor) immediately except ManiphestTaskQuery, and loadCursorObject() has no "natural" callsites.

I think we can break this up slightly and reuse the last partial object we loaded instead of querying for it again. This is probably simpler overall and reduces the number of queries we must issue.

An adjacent issue is that the current behavior of overheated queries is to return a partial result set and set a flag. The goal of this behavior is to let us show some results if we find some results, but it may be surprising if the caller doesn't handle overheated queries, since they may look like a queries which simply failed to match anything.

Since the overheating mechanism doesn't seem to have any fundamental problems, I'm going to change the behavior to throw by default, and return partial results + set a flag as a configurable behavior. This will give us better behavior for cases like bin/repository rebuild-identities --all with a heavily broken database: we'll get an exception instead of an apparent success after failing to load 10 consecutive pages of results (~1K commits).

We could further refine this by letting scripts overheat more (e.g., examine 100 pages or an unlimited number of pages before overheating), but overheating from a script context is likely indicative of some serious database consistency issue anyway, so it might be good to learn about it.

epriestley triaged this task as Normal priority.

I've poked around a fair bit locally, although I'm not totally confident I got every combination of paging columns and queries. I think anything that was missed is very likely to produce an explicit error rather than mysterious behavior, so I'm not too worried about it. Stuff to keep an eye on:

  • Explicit errors when clicking "Next Page" in obscure interfaces and/or with particular "Sort By" query orderings.
  • Maybe weird behavior from "before"/"after" in Conduit? But I think I tested this pretty thoroughly.
  • Things that would previously fail silently now overheating explicitly (after D20294). I'm planning to handle these on a case-by-case basis if they do crop up. Overheating is generally rare and usually requires an install have a sequence of about 1,100 consecutive objects where at least 90% of the objects are not visible to the viewer. This definitely happens, but not frequently.

I also checked for more leftovers in instances/, but it looks like we don't have any non-default paging there.

epriestley closed this task as Resolved.Mar 19 2019, 8:45 PM

I believe this is now resolved. See D20291 for technical details on the main issue and T13266 for a followup.


The cursor issue is because a large number of commits with an invalid repositoryID were present in the database. ...

Resolved by D20291, D20292, D20293, and D20294.

The "Mangled Webserver Response" setup check may incorrectly detect that extra whitespace is being added to responses if they are actually being rejected outright by something like an LB/VPN/Auth layer.

Resolved by D20235.

Duo auth is sometimes raising a fatal from $control->setLabel() near line 665.

Resolved by D20234.

User feedback: literally every user was unable to to identify that the "Clickable" object item lists were clickable based on the UI.

Perhaps resolved by D20238.