Page MenuHomePhabricator

Improve performance of Ferret engine ngram extraction, particularly for large input strings
ClosedPublic

Authored by epriestley on Sep 26 2017, 4:21 PM.
Tags
None
Referenced Files
F18199058: D18649.id44786.diff
Mon, Aug 18, 1:33 AM
F18198062: D18649.id.diff
Sun, Aug 17, 10:31 PM
F18098122: D18649.id44786.diff
Fri, Aug 8, 6:50 PM
F18095606: D18649.id44787.diff
Fri, Aug 8, 12:52 AM
F18004391: D18649.diff
Sat, Aug 2, 12:10 PM
F17831051: D18649.diff
Sat, Jul 26, 11:38 AM
Unknown Object (File)
Jun 30 2025, 7:27 AM
Unknown Object (File)
Jun 1 2025, 5:45 PM
Subscribers
None

Details

Summary

See PHI87. Ref T12974. The array_slice() method of splitting the string apart can perform poorly for large input strings. I think this is mostly just the large number of calls plus building and returning an array being not entirely trivial.

We can just use substr() instead, as long as we're a little bit careful about keeping track of where we're slicing the string if it has UTF8 characters.

Test Plan

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I've spent a lot of time staring at this and I'm pretty convinced it works. Maybe add a few more unit tests for strings of length {0,1,2}?

This revision is now accepted and ready to land.Sep 27 2017, 5:28 PM
  • Add a couple more test cases for short strings.
This revision was automatically updated to reflect the committed changes.