Split Ferret engine strings for tokenization on any sequence of whitespace
ClosedPublic
Actions

Authored by epriestley on Sep 8 2017, 3:08 PM.

Details

Reviewers

chad

Maniphest Tasks

T12819: InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size

Commits

rP7ea6de6e9c9d: Split Ferret engine strings for tokenization on any sequence of whitespace

Summary

Ref T12819. Currently, strings are split only on spaces, but newlines (and, if they exist, tabs) should also split strings.

Without this, we can fail to get the proper term boundary tokens for words which begin at the start of a line or end at the end of a line.

Test Plan

Reindexed a document with "xyz\nabc", saw "yz " and " ab" term boundary tokens generate properly.

Diff Detail

Repository

rP Phabricator

Branch

ferret30

Lint

Lint Passed

Unit

Tests Passed

Build Status

Buildable 18397
Build 24770: Run Core Tests
Build 24769: arc lint + arc unit

Event Timeline

epriestley created this revision.Sep 8 2017, 3:08 PM

Harbormaster completed remote builds in B18397: Diff 44616.Sep 8 2017, 3:10 PM

chad accepted this revision.Sep 8 2017, 4:07 PM

This revision is now accepted and ready to land.Sep 8 2017, 4:07 PM

Closed by commit rP7ea6de6e9c9d: Split Ferret engine strings for tokenization on any sequence of whitespace (authored by epriestley). · Explain WhySep 8 2017, 4:40 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

src/

applications/

search/

ferret/

PhabricatorFerretEngine.php

2 lines

Diff 44616

View Options

src/applications/search/ferret/PhabricatorFerretEngine.php

Split Ferret engine strings for tokenization on any sequence of whitespaceClosedPublicActions