(stable) Improve Ferret engine indexing performance for large blocks of text
Summary:
See PHI87. Ref T12974. Currently, we do a lot more work here than we need to: we call phutil_utf8_strtolower() on each token, but can do it once at the beginning on the whole block.
Additionally, since ngrams don't care about order, we only need to convert unique tokens into ngrams. This saves us some phutil_utf8v(). These calls can be slow for large inputs.
Test Plan:
- Created a ~4MB task description.
- Ran bin/search index Txxx --profile ... to profile indexing performance before and after the change.
- Saw total runtime drop form 38s to 9s.
- Before: https://secure.phabricator.com/xhprof/profile/PHID-FILE-wiht5d7lkyazaywwxovw/
- After: https://secure.phabricator.com/xhprof/profile/PHID-FILE-efxv56q2hulr6kjrxbx6/
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T12974
Differential Revision: https://secure.phabricator.com/D18647