Improve performance of Ferret engine ngram extraction, particularly for large input strings
Summary:
See PHI87. Ref T12974. The array_slice() method of splitting the string apart can perform poorly for large input strings. I think this is mostly just the large number of calls plus building and returning an array being not entirely trivial.
We can just use substr() instead, as long as we're a little bit careful about keeping track of where we're slicing the string if it has UTF8 characters.
Test Plan:
- Created a task with a single, unbroken blob of base64 encoded data as the description, roughly 100KB long.
- Saw indexing performance improve from ~6s to ~1.5s after patch.
- Before: https://secure.phabricator.com/xhprof/profile/PHID-FILE-nrxs4lwdvupbve5lhl6u/
- After: https://secure.phabricator.com/xhprof/profile/PHID-FILE-6vs2akgjj5nbqt7yo7ul/
Reviewers: amckinley
Reviewed By: amckinley
Maniphest Tasks: T12974
Differential Revision: https://secure.phabricator.com/D18649