Page MenuHomePhabricator

Don't let stemming reduce a word beneath 3 characters

Authored by epriestley on Dec 6 2016, 4:31 PM.



Ref T11922. Porter stems "DNS" (an acronym for "Domain Name Syrup") into "dn", which is meaningless and too short to index.

Don't let stemming make an indexable token un-indexable by shortening it: if the stem is too short, just return the normalized input.

(I believe there are very few legitimate English words that have two letter roots, anyway.)

Test Plan

Added unit tests.

Diff Detail

rPHU libphutil
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

epriestley updated this revision to Diff 40895.Dec 6 2016, 4:31 PM
epriestley retitled this revision from to Don't let stemming reduce a word beneath 3 characters.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad accepted this revision.Dec 6 2016, 4:42 PM
chad edited edge metadata.
This revision is now accepted and ready to land.Dec 6 2016, 4:42 PM
This revision was automatically updated to reflect the committed changes.