Page MenuHomePhabricator

Remove call to PHP "utf8_decode()" in "phutil_utf8_strlen()"
ClosedPublic

Authored by epriestley on Feb 4 2023, 1:50 PM.
Tags
None
Referenced Files
F13037854: D21857.diff
Mon, Apr 15, 8:15 PM
F13037759: D21857.diff
Mon, Apr 15, 6:39 PM
Unknown Object (File)
Thu, Apr 11, 4:31 AM
Unknown Object (File)
Wed, Apr 10, 4:58 PM
Unknown Object (File)
Sat, Apr 6, 9:31 AM
Unknown Object (File)
Sat, Mar 30, 10:31 AM
Unknown Object (File)
Sat, Mar 30, 10:31 AM
Unknown Object (File)
Sat, Mar 30, 10:30 AM
Subscribers
None

Details

Summary

Ref T13588. See PHI2228. Under PHP8.2 and newer, calls to "utf8_decode()" raise a deprecation warning.

The behavior of this function probably isn't great under any PHP version since it maps UTF8 space into ISO-8859-1 space, which isn't an operation you can really implement "correctly" in the general case. For the specific narrow case of counting characters, as here, you could probably do worse, but I think the PHP upstream's deprecation of this function is entirely reasonable.

The fallback implementation for strlen, "phutil_utf8v()", should produce the correct result in all cases. The downside is that it's runtime UTF8 parsing in PHP, which may be slow. This function is called rarely and I don't think this will present a problem, but if it does I'd rather address it once it arises.

Test Plan

Under PHP8.2, ran "arc lint". After patch, no longer saw deprecation warning.

Diff Detail

Repository
rARC Arcanist
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley created this revision.

An easy optimization here is likely to avoid calling phutil_utf8v(...) on very long strings using a strategy like this:

function phutil_utf8_strlen_compare($string, $operator, $value) {
  // To compute the length of a UTF8 string, we currently must vectorize it
  // with "phutil_utf8v()". This operation has complexity "O(N)" on the
  // length of the string, so it is relatively expensive if the string
  // is long.

  // If we know which comparison we're performing, we can sometimes perform
  // the comparison in "O(1)" without vectorizing the string, since extreme
  // cases are trivially true or false, as we can easily find an upper and
  // lower bound for the character length of string from the byte length.

  // A byte string of length "N" has a maximum UTF8 character length of "N"
  // (if each byte is a one-byte character, like regular ASCII text).

  // A byte string of length "N" has a minimum UTF8 character length of
  // "ceil(N / 6)".

  $max_length = strlen($string);
  $min_length = (int)ceil($max_length / 6);

  switch ($operator) {
    case '<':
      if ($max_length < $value) {
        return true;
      }
      if ($min_length >= $value) {
        return false;
      }
      break;
    default:
      throw new Exception(
        pht(
          'Unknown string comparison operator "%s"!',
          $operator));
  }

  $actual_length = phutil_utf8_strlen($string);

  // ...

...but I'd like evidence that this is a useful optimization before pursuing it.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 4 2023, 1:54 PM
This revision was automatically updated to reflect the committed changes.
epriestley edited the summary of this revision. (Show Details)