I'll start by saying that I'm not sure if this is a bug or expected behavior.
When indexing documents, you stem for the full corpus here in PhabricatorMySQLFulltextStorageEngine. This will stem This is a test as thi test. It looks like it leaves out is a because of this line: https://secure.phabricator.com/source/libphutil/browse/master/src/search/PhutilSearchStemmer.php;9d85dfab0f532d50c2343719e92d574a4827341b$16.
When compiling search queries, you tokenize the query and then stem each token as seen here: https://secure.phabricator.com/source/libphutil/browse/master/src/search/PhutilSearchQueryCompiler.php;9d85dfab0f532d50c2343719e92d574a4827341b$241. This will lead to the following query issued for This is a test.
SELECT documentPHID, MAX(fieldScore) AS documentScore FROM (SELECT document.phid AS documentPHID, IF(field.field = 'titl', 1024, 0) + MATCH(corpus, stemmedCorpus) AGAINST ('\"thi\" \"is\" \"a\" \"test\"' IN BOOLEAN MODE) AS fieldScore FROM `search_document` document JOIN `search_documentfield` field ON field.phid = document.phid JOIN `search_documentrelationship` AS `statuses` ON `statuses`.phid = document.phid AND `statuses`.relation = 'open' WHERE MATCH(corpus, stemmedCorpus) AGAINST ('\"thi\" \"is\" \"a\" \"test\"' IN BOOLEAN MODE) LIMIT 1000) query JOIN `search_document` root ON query.documentPHID = root.phid GROUP BY documentPHID ORDER BY documentScore DESC LIMIT 0, 101
Since you are storing a string that is stemmed and excludes words with less than three characters, shouldn't this query try to match against thi test rather than thi is a test?
Phabricator versions:
phabricator cda72bdb914cf8e70ad89a8f5cd9bf9fea1e6aea (Fri, Jan 13) (branched from 7276af6a81f49bbdc14ace064aab50afbeb79cfc on origin) arcanist 9503b941cc02be637d967bb50cfb25f852e071e4 (Fri, Jan 6) (branched from ade25facfdf22aed1c1e20fed3e58e60c0be3c2b on origin) phutil bf4fdf396761eb2a03da19c16386df36b9d8c56c (Fri, Jan 13) (branched from 9d85dfab0f532d50c2343719e92d574a4827341b on origin)