Page MenuHomePhabricator

Split Ferret engine strings for tokenization on any sequence of whitespace
ClosedPublic

Authored by epriestley on Sep 8 2017, 3:08 PM.
Tags
None
Referenced Files
F13292712: D18579.diff
Wed, Jun 5, 4:56 AM
F13273012: D18579.diff
Fri, May 31, 1:28 AM
F13270207: D18579.id44616.diff
Wed, May 29, 9:21 AM
F13246982: D18579.diff
Thu, May 23, 2:29 PM
F13239184: D18579.id44616.diff
Wed, May 22, 12:54 AM
F13236326: D18579.diff
Tue, May 21, 9:13 AM
F13217671: D18579.diff
Sat, May 18, 6:58 AM
F13180615: D18579.id44619.diff
Thu, May 9, 1:30 AM
Subscribers
None

Details

Summary

Ref T12819. Currently, strings are split only on spaces, but newlines (and, if they exist, tabs) should also split strings.

Without this, we can fail to get the proper term boundary tokens for words which begin at the start of a line or end at the end of a line.

Test Plan

Reindexed a document with "xyz\nabc", saw "yz " and " ab" term boundary tokens generate properly.

Diff Detail

Repository
rP Phabricator
Branch
ferret30
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 18397
Build 24770: Run Core Tests
Build 24769: arc lint + arc unit