Page MenuHomePhabricator

Improve UTF8StringTruncator behavior for huge inputs
ClosedPublic

Authored by epriestley on Oct 26 2015, 2:53 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 27, 7:42 PM
Unknown Object (File)
Wed, Nov 27, 7:38 PM
Unknown Object (File)
Wed, Nov 27, 5:53 PM
Unknown Object (File)
Wed, Nov 27, 4:05 PM
Unknown Object (File)
Wed, Nov 27, 6:32 AM
Unknown Object (File)
Mon, Nov 18, 10:51 PM
Unknown Object (File)
Thu, Nov 14, 11:50 PM
Unknown Object (File)
Mon, Nov 11, 11:12 PM
Subscribers
None

Details

Summary

Fixes T9632. Currently, when you truncate a very big input (like a huge paste) into a very small output (like a snippet of that paste), it can take a long time. The amount of work we do is proportional to the size of the input.

Reorganize some of the UTF8 code so we can do less work, and only examine about as much of the input as we can possibly need to look at in order to generate the desired output.

Test Plan
  • This code is well-covered by unit tests.
  • Added a new unit test which ran in ~4s before the change and runs in ~2ms afterward on my machine (2000x).
  • Created a huge paste, viewed from web UI.

Diff Detail

Repository
rPHU libphutil
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Improve UTF8StringTruncator behavior for huge inputs.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
src/utils/PhutilUTF8StringTruncator.php
123

Adding $hard_limit here is the biggest impact, and just says "only look at the beginning of the string since we don't need more than that".

129

Passing $string_pv ("string codepoint vector") here instead of $string ("raw string") saves us a little work. Previously, this function immediately called phutil_utf8v() on the string again to turn it into a vector.

src/utils/utf8.php
747

The changes here are:

  • Expose a lower-level function which takes output from phutil_utf8v() so we don't have to call it twice.
  • Avoid the extra phutil_utf8v() call if the first character is a combining character (we'll almost never hit this).
  • Avoid the unnecessary array_values() calls on a potentially large array.
  • Slightly optimize the common case of ASCII strings.
  • Try to make the function a little clearer overall.
chad edited edge metadata.
This revision is now accepted and ready to land.Oct 26 2015, 4:44 PM
This revision was automatically updated to reflect the committed changes.