Split Ferret engine strings for tokenization on any sequence of whitespace
7ea6de6e9c9d
Actions

Description

Split Ferret engine strings for tokenization on any sequence of whitespace

Summary:
Ref T12819. Currently, strings are split only on spaces, but newlines (and, if they exist, tabs) should also split strings.

Without this, we can fail to get the proper term boundary tokens for words which begin at the start of a line or end at the end of a line.

Test Plan: Reindexed a document with "xyz\nabc", saw "yz " and " ab" term boundary tokens generate properly.

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T12819

Differential Revision: https://secure.phabricator.com/D18579

Details

Provenance

epriestley	Authored on Sep 8 2017, 3:06 PM
epriestley	Pushed on Sep 8 2017, 4:39 PM

Reviewer

chad

Differential Revision

D18579: Split Ferret engine strings for tokenization on any sequence of whitespace

Parents

rP4cae4a3b767f: Correct `bin/storage analyze` internal API for cluster environments

Branches

Unknown

Tags

Unknown

Tasks

T12819: InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size

Build Status

Buildable 18400
Build 24775: Run Core Tests

Event Timeline

epriestley committed rP7ea6de6e9c9d: Split Ferret engine strings for tokenization on any sequence of whitespace (authored by epriestley).Sep 8 2017, 4:39 PM

epriestley added a task: T12819: InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size.

Harbormaster completed building B18400: rP7ea6de6e9c9d: Split Ferret engine strings for tokenization on any sequence of whitespace.Sep 8 2017, 4:41 PM

Changes (1)

Path

Size

src/

applications/

search/

ferret/

PhabricatorFerretEngine.php

rP7ea6de6e9c9d

View Options

src/applications/search/ferret/PhabricatorFerretEngine.php