Page MenuHomePhabricator

Improve performance of Ferret engine ngram extraction, particularly for large input strings
ClosedPublic

Authored by epriestley on Sep 26 2017, 4:21 PM.
Tags
None
Referenced Files
F12834959: D18649.id44787.diff
Thu, Mar 28, 2:54 PM
Unknown Object (File)
Thu, Mar 28, 4:08 AM
Unknown Object (File)
Thu, Mar 28, 4:08 AM
Unknown Object (File)
Thu, Mar 28, 4:08 AM
Unknown Object (File)
Wed, Mar 27, 4:38 PM
Unknown Object (File)
Tue, Mar 26, 12:33 AM
Unknown Object (File)
Fri, Mar 15, 12:17 PM
Unknown Object (File)
Feb 3 2024, 4:59 PM
Subscribers
None

Details

Summary

See PHI87. Ref T12974. The array_slice() method of splitting the string apart can perform poorly for large input strings. I think this is mostly just the large number of calls plus building and returning an array being not entirely trivial.

We can just use substr() instead, as long as we're a little bit careful about keeping track of where we're slicing the string if it has UTF8 characters.

Test Plan

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I've spent a lot of time staring at this and I'm pretty convinced it works. Maybe add a few more unit tests for strings of length {0,1,2}?

This revision is now accepted and ready to land.Sep 27 2017, 5:28 PM
  • Add a couple more test cases for short strings.
This revision was automatically updated to reflect the committed changes.