Page MenuHomePhabricator

Improve UTF8StringTruncator behavior for huge inputs
ClosedPublic

Authored by epriestley on Oct 26 2015, 2:53 PM.
Tags
None
Referenced Files
F14110005: D14339.diff
Wed, Nov 27, 4:05 PM
F14106727: D14339.id34609.diff
Wed, Nov 27, 6:32 AM
Unknown Object (File)
Mon, Nov 18, 10:51 PM
Unknown Object (File)
Thu, Nov 14, 11:50 PM
Unknown Object (File)
Mon, Nov 11, 11:12 PM
Unknown Object (File)
Mon, Nov 11, 2:43 AM
Unknown Object (File)
Fri, Nov 8, 1:15 AM
Unknown Object (File)
Fri, Nov 8, 1:14 AM
Subscribers
None

Details

Summary

Fixes T9632. Currently, when you truncate a very big input (like a huge paste) into a very small output (like a snippet of that paste), it can take a long time. The amount of work we do is proportional to the size of the input.

Reorganize some of the UTF8 code so we can do less work, and only examine about as much of the input as we can possibly need to look at in order to generate the desired output.

Test Plan
  • This code is well-covered by unit tests.
  • Added a new unit test which ran in ~4s before the change and runs in ~2ms afterward on my machine (2000x).
  • Created a huge paste, viewed from web UI.

Diff Detail

Repository
rPHU libphutil
Branch
utfx1
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 8415
Build 9672: Run Core Tests
Build 9671: arc lint + arc unit

Event Timeline

epriestley retitled this revision from to Improve UTF8StringTruncator behavior for huge inputs.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
src/utils/PhutilUTF8StringTruncator.php
123

Adding $hard_limit here is the biggest impact, and just says "only look at the beginning of the string since we don't need more than that".

129

Passing $string_pv ("string codepoint vector") here instead of $string ("raw string") saves us a little work. Previously, this function immediately called phutil_utf8v() on the string again to turn it into a vector.

src/utils/utf8.php
747

The changes here are:

  • Expose a lower-level function which takes output from phutil_utf8v() so we don't have to call it twice.
  • Avoid the extra phutil_utf8v() call if the first character is a combining character (we'll almost never hit this).
  • Avoid the unnecessary array_values() calls on a potentially large array.
  • Slightly optimize the common case of ASCII strings.
  • Try to make the function a little clearer overall.
chad edited edge metadata.
This revision is now accepted and ready to land.Oct 26 2015, 4:44 PM
This revision was automatically updated to reflect the committed changes.