Stemming + MySQL Fulltext search can lead to unexpected results
Open, Needs TriagePublic

Description

I'll start by saying that I'm not sure if this is a bug or expected behavior.

When indexing documents, you stem for the full corpus here in PhabricatorMySQLFulltextStorageEngine. This will stem This is a test as thi test. It looks like it leaves out is a because of this line: https://secure.phabricator.com/source/libphutil/browse/master/src/search/PhutilSearchStemmer.php;9d85dfab0f532d50c2343719e92d574a4827341b$16.

When compiling search queries, you tokenize the query and then stem each token as seen here: https://secure.phabricator.com/source/libphutil/browse/master/src/search/PhutilSearchQueryCompiler.php;9d85dfab0f532d50c2343719e92d574a4827341b$241. This will lead to the following query issued for This is a test.

SELECT documentPHID, MAX(fieldScore) AS documentScore FROM (SELECT document.phid AS documentPHID, IF(field.field = 'titl', 1024, 0) + MATCH(corpus, stemmedCorpus) AGAINST ('\"thi\" \"is\" \"a\" \"test\"' IN BOOLEAN MODE) AS fieldScore FROM `search_document` document JOIN `search_documentfield` field ON field.phid = document.phid JOIN `search_documentrelationship` AS `statuses` ON `statuses`.phid = document.phid AND `statuses`.relation = 'open' WHERE MATCH(corpus, stemmedCorpus) AGAINST ('\"thi\" \"is\" \"a\" \"test\"' IN BOOLEAN MODE) LIMIT 1000) query JOIN `search_document` root ON query.documentPHID = root.phid GROUP BY documentPHID ORDER BY documentScore DESC LIMIT 0, 101

Since you are storing a string that is stemmed and excludes words with less than three characters, shouldn't this query try to match against thi test rather than thi is a test?

Phabricator versions:

phabricator cda72bdb914cf8e70ad89a8f5cd9bf9fea1e6aea (Fri, Jan 13) (branched from 7276af6a81f49bbdc14ace064aab50afbeb79cfc on origin) 
arcanist 9503b941cc02be637d967bb50cfb25f852e071e4 (Fri, Jan 6) (branched from ade25facfdf22aed1c1e20fed3e58e60c0be3c2b on origin) 
phutil bf4fdf396761eb2a03da19c16386df36b9d8c56c (Fri, Jan 13) (branched from 9d85dfab0f532d50c2343719e92d574a4827341b on origin)
ftdysa created this task.Jan 20 2017, 10:56 PM

How do I reproduce unexpected results at the application level (for example, what's something I can search for that doesn't match but should match)?

(If a piece of internal behavior doesn't actually do anything bad, it probably isn't a bug.)

cda72bdb914cf8e70ad89a8f5cd9bf9fea1e6aea

As far as I can tell, this doesn't appear in the upstream, suggesting you likely have local patches. When you provide reproduction steps, please make sure they work against code we can test against!

I'll make sure to update this task with full reproduction steps on Monday.

As for the commit, I'm not sure why that isn't in the upstream. I run off
stable and don't run any custom code. I do sometimes have to do a merge
commit when i pull down changes so maybe i did something weird when we
switched from running off master to running off stable.

Sorry for the incomplete report!

  1. Create a task named This is a test
  2. Search for This is a test, nothing found
  1. Create a task named This task has longer words (all > 2 chars)
  2. Search for This task has longer words, task is found as expected

I've reproduced this here: https://test-t7vrxc7jucrk.phacility.com

chad added a subscriber: chad.Thu, Jan 26, 2:18 AM

Do you have reproduction steps that are not hypothetical? I assume you actually ran into this as a bug, so knowing the specifics is important here.

ftdysa added a comment.EditedThu, Jan 26, 3:19 PM

When I ran into this issue, I was attempting to automatically file tasks when my application encountered some unknown exception/error. I didn't want to file duplicate tasks for the same errors, so I was first going to look up the task name that followed some standard format like Error: <file> <line> .... I was testing with the string This is a test and was confused whythe conduit maniphest.search endpoint was not returning any results, but I could lookup other task names fine.

This could just be T9979 but since it appears to work if you stem the corpus and pass that into the search, it looked like a bug to me and I thought I'd report it.

I can get around this by just hashing some unique string that makes up the error and dumping that into the task description and search off of that since full-text searching for single strings works as expected.

I can get around this by just hashing some unique string that makes up the error and dumping that into the task description and search off of that since full-text searching for single strings works as expected.

You should probably do this anyway, since stemming means that "deleting notes" might find a "delete note" task and incorrectly merge into it.

If the unique string you hash is generated by mushing together all the paths and line numbers of the stack trace, your hash will also be robust to parameterized error messages. When I've done this in the past, using this strategy was effective in merging error messages like "The name you entered, "<variable user text>", is not valid. It must have exactly three uppercase letters.", which won't merge cleanly if you just use the error message because each message will have different user text.

Also, searching for "this is a test" (with quotes) looks like it works correctly. Maybe you're searching for this is a test to get around the variant-text issue?

Ideally, you could (and perhaps should) use custom fields for this, not error stack traces (e.g., a custom string field "Error Stack Trace Hash").

So I don't think this is a particularly great use case, but I do think this bug is worth fixing even if users rarely hit it in practice.

I don't think this is a stemming problem. Note that we MATCH(corpus, stemmedCorpus) against the query, so we're actually searching in the text "this is a test thi test". I wrote stemming so that it should never match fewer documents than an unstemmed query would -- definitely possible I got something wrong, but that's the intent, at least: stemming only gives you more hits.

Rather, MySQL seems to treat an "AND" term in "IN BOOLEAN MODE" which is below the minimum token length as never valid in InnoDB. Here's a simplified case:

mysql> SELECT * FROM documents WHERE MATCH(corpus) AGAINST ('zebra a' IN BOOLEAN MODE);
+----+------------------+
| id | corpus           |
+----+------------------+
|  2 | zebra a          |
|  3 | zebra is a horse |
|  4 | zebra xz         |
|  5 | zebra where      |
+----+------------------+
4 rows in set (0.00 sec)

mysql> SELECT * FROM documents WHERE MATCH(corpus) AGAINST ('zebra +a' IN BOOLEAN MODE);
Empty set (0.00 sec)

The difference between the queries is the +a in the second query.

Note that this seems to be a behavioral difference between InnoDB and MyISAM. The example above is on InnoDB. Here are the same queries on MyISAM:

mysql> SELECT * FROM documents WHERE MATCH(corpus) AGAINST ('zebra a' IN BOOLEAN MODE);
+----+------------------+
| id | corpus           |
+----+------------------+
|  2 | zebra a          |
|  3 | zebra is a horse |
|  4 | zebra xz         |
|  5 | zebra where      |
+----+------------------+
4 rows in set (0.00 sec)

mysql> SELECT * FROM documents WHERE MATCH(corpus) AGAINST ('zebra +a' IN BOOLEAN MODE);
+----+------------------+
| id | corpus           |
+----+------------------+
|  2 | zebra a          |
|  3 | zebra is a horse |
|  4 | zebra xz         |
|  5 | zebra where      |
+----+------------------+
4 rows in set (0.00 sec)

However, both engines have the same behavior if we use + and also quote the term. Here's +"a" in MyISAM:

mysql> SELECT * FROM documents WHERE MATCH(corpus) AGAINST ('zebra +"a"' IN BOOLEAN MODE);
Empty set (0.00 sec)

These behaviors feel fairly unintuitive and possibly buggy to me, but we can work around them relatively easily.

This appears to also affect words on the stopword list, even if they are longer than the minimum token length. This is a pain because we can't read the stopword list directly from MySQL in MyISAM, but still something we can navigate.

Thanks for looking further into this @epriestley and sorry for the less than ideal bug report. Please feel free to re-title/merge/close as you see fit.

For completion, it looks like my search_document table is InnoDB but the search_documentfield is MyISAM.

mysql> select table_name, engine from information_schema.tables where table_schema='phabricator_search';
+---------------------------------------------+--------+
| table_name                                  | engine |
+---------------------------------------------+--------+
| edge                                        | InnoDB |
| edgedata                                    | InnoDB |
| search_document                             | InnoDB |
| search_documentfield                        | MyISAM |
| search_documentrelationship                 | InnoDB |
| search_editengineconfiguration              | InnoDB |
| search_editengineconfigurationtransaction   | InnoDB |
| search_indexversion                         | InnoDB |
| search_namedquery                           | InnoDB |
| search_profilepanelconfiguration            | InnoDB |
| search_profilepanelconfigurationtransaction | InnoDB |
| search_savedquery                           | InnoDB |
| stopwords                                   | InnoDB |
+---------------------------------------------+--------+
13 rows in set (0.00 sec)

I am running stable and have followed along with T11922 and T11741. It looks like my server supports InnoDB fulltext search so should search_documentfield been converted over during storage upgrade?

mysql> SELECT @@ft_boolean_syntax AS syntax;
--------------
SELECT @@ft_boolean_syntax AS syntax
--------------

+----------------+
| syntax         |
+----------------+
|  |-><()~*:""&^ |
+----------------+
1 row in set (0.00 sec)

I can also confirm that "This is a test" works as expected. I could have sworn that I tested that but I guess I missed it. Woops.


As for my install having commit hashes not in the upstream ... I'm not entirely sure what I did here to screw that up.

It does look like I originally checked phabricator sources out from github, but that should not matter?

Here is most recent git log entries in phabricator.

commit cda72bdb914cf8e70ad89a8f5cd9bf9fea1e6aea
Merge: 0d33937 0426ce7
Author: xxxx <xxxx@xxxx.xxxx>
Date:   Fri Jan 13 17:48:57 2017 -0500

    Merge branch 'stable' of https://github.com/phacility/phabricator into stable

commit 0426ce73f0e63f1900f1cc285cfa1465ea72317e
Merge: 40be2d5 7276af6
Author: epriestley <git@epriestley.com>
Date:   Fri Jan 13 14:39:57 2017 -0800

    (stable) Promote 2017 Week 2

I use a basic upgrade script that just cds into each directory and runs a git pull. Each directory is running the stable branch. Should I be doing a fetch + merge instead?

Ah woops, looks like I searched for the wrong variable for InnoDB fulltext support:

mysql> SELECT @@innodb_ft_max_token_size;
ERROR 1193 (HY000): Unknown system variable 'innodb_ft_max_token_size'

It looks like my server supports InnoDB fulltext search

Can you show me the output of this command?

mysql> SELECT @@innodb_ft_max_token_size;

According to that test, your server does not support InnoDB fulltext, so what you're seeing is expected: we keep the tables with FULLTEXT indexes on MyISAM if innodb_ft_max_token_size is not a defined system variable.

I'm not entirely sure what I did here to screw that up.

Can you show me the output of this command?

$ git show 0d33937
commit 0d3393731b3b915c494cece7676eec22f2f68578
Merge: 5abccaf 9d10727
Author: xxx <xxx@phabricator-1.xxxx>
Date:   Thu Jan 5 18:04:21 2017 -0500

    Merge branch 'stable' of https://github.com/phacility/phabricator into stable

Ah, that's probably from an earlier git pull. Each of those merges have two parents, like this in the most recent output:

Merge: 5abccaf 9d10727

There, 9d10727 is part of the upstream. 5abccaf is a local change.

If you keep following the chain back with git show, you should figure out what started the divergence, which is likely a local commit.

You can also probably identify it something like this:

$ git log stable --not origin/stable

You can git reset --hard origin/stable to discard all your local commits and merges, which should stop producing merge commits when you git pull. However, you should figure out where the divergence started first, in case it was something important that you don't remember.

Yeah, I think I screwed something up when I switched from master to stable. After following the merge commits all the way down I ran into this commit:

phabricator-1:# git show 45cbc82
commit 45cbc825bb4d1f1ce8f591e812b14418df42c4c3
Merge: 2dad469 edf7c85
Author: xxx <xxx@phabricator-1.xxxxt>
Date:   Fri Sep 11 06:09:17 2015 -0400

    Stable

That merge is:

phabricator-1:# git show 2dad469
commit 2dad469eca22e344e6e35d7560e822a107aa318e
Author: epriestley <git@epriestley.com>
Date:   Mon Jun 22 11:47:58 2015 -0700

and

phabricator-1:# git show edf7c85
commit edf7c85e7a8526a2895d481895a3cf13aca9e0e4
Merge: 5125045 76665f7
Author: epriestley <git@epriestley.com>
Date:   Sat Sep 5 05:14:04 2015 -0700

    (stable) Promote 2015 Week 36

I know that we haven't yet made any local changes so I'm going to move forward with a git reset --hard origin/stable and test this week's stable. Sorry for wasting your time and thanks for the help!

I'm now running on the following after performing last week's upgrade.

phabricator 23c54262caf84abdad82974212170ff24d138f4d (Sun, Jan 22) (branched from ddf82a815b9a07e901870c2f4d5b7582af7b4d82 on origin) 
arcanist 9503b941cc02be637d967bb50cfb25f852e071e4 (Fri, Jan 6) (branched from ade25facfdf22aed1c1e20fed3e58e60c0be3c2b on origin) 
phutil 10963f771f118baa338aacd3172aaede695cde62 (Fri, Jan 13) (branched from 9d85dfab0f532d50c2343719e92d574a4827341b on origin)
# git show 2dad469
commit 2dad469eca22e344e6e35d7560e822a107aa318e
Author: epriestley <git@epriestley.com>
Date:   Mon Jun 22 11:47:58 2015 -0700

Aha! That isn't in the upstream -- the upstream commit with the same timestamp is rP0597aba3:

$ git show 0597aba3
commit 0597aba33ec18502843b50dac7bb558bba337f37
Author: epriestley <git@epriestley.com>
Date:   Mon Jun 22 11:47:58 2015 -0700

    Add hard stops on empty batch edit sets
...

I would guess someone locally amended that commit at some point.

(There's some tiny possibility that upstream force-push hyjinx were involved -- we've force-pushed a handful of times to remedy various errors which were caught quickly enough -- but that commit doesn't look very suspicious from a force-push perspective.)

I probably did something dumb when changing over from master to stable (I'm the only one that administers this instance and I know we've never modified phab's source). I should be good to go on the version front now?

Also, I tried to buy you and Chad some beers but I'm seeing the following error when I try to add a payment method via Phortune (I know it's a prototype, so feel free to just ignore).

There was an error decoding error information submitted by the client. Expected a JSON-encoded list of error codes, received: nothing.

Which is probably due to the following JS error:

ReferenceError: Stripe is not defined[Learn More]  behavior-stripe-payment-form.js:1:52

This was on this install at this URL: https://secure.phabricator.com/phortune/819/card/new/?merchantID=2&cartID=99

Yeah, you should be good now. I'll look into the JS thing since we like money.

alexmv added a subscriber: alexmv.Thu, Feb 9, 2:56 AM