Page MenuHomePhabricator

Split Ferret engine strings for tokenization on any sequence of whitespace
ClosedPublic

Authored by epriestley on Sep 8 2017, 3:08 PM.
Tags
None
Referenced Files
F15540517: D18579.id.diff
Fri, Apr 25, 8:04 AM
F15537090: D18579.diff
Thu, Apr 24, 12:07 PM
F15507368: D18579.diff
Tue, Apr 15, 7:03 PM
F15503196: D18579.id44619.diff
Mon, Apr 14, 7:48 AM
F15461054: D18579.id44616.diff
Tue, Apr 1, 4:47 AM
F15441240: D18579.diff
Mar 26 2025, 5:04 PM
F15438552: D18579.id44616.diff
Mar 26 2025, 1:28 AM
F15434744: D18579.id.diff
Mar 25 2025, 5:07 AM
Subscribers
None

Details

Summary

Ref T12819. Currently, strings are split only on spaces, but newlines (and, if they exist, tabs) should also split strings.

Without this, we can fail to get the proper term boundary tokens for words which begin at the start of a line or end at the end of a line.

Test Plan

Reindexed a document with "xyz\nabc", saw "yz " and " ab" term boundary tokens generate properly.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable