Remove call to PHP "utf8_decode()" in "phutil_utf8_strlen()"
ClosedPublic
Actions

Authored by epriestley on Feb 4 2023, 1:50 PM.

Details

Reviewers: None
Maniphest Tasks: T13588: PHP 8 Compatibility
Commits: rARCd87d5f0e02e2: Remove call to PHP "utf8_decode()" in "phutil_utf8_strlen()"

Summary

Ref T13588. See PHI2228. Under PHP8.2 and newer, calls to "utf8_decode()" raise a deprecation warning.

The behavior of this function probably isn't great under any PHP version since it maps UTF8 space into ISO-8859-1 space, which isn't an operation you can really implement "correctly" in the general case. For the specific narrow case of counting characters, as here, you could probably do worse, but I think the PHP upstream's deprecation of this function is entirely reasonable.

The fallback implementation for strlen, "phutil_utf8v()", should produce the correct result in all cases. The downside is that it's runtime UTF8 parsing in PHP, which may be slow. This function is called rarely and I don't think this will present a problem, but if it does I'd rather address it once it arises.

Test Plan

Under PHP8.2, ran "arc lint". After patch, no longer saw deprecation warning.

Diff Detail

Repository

rARC Arcanist

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley requested review of this revision.Feb 4 2023, 1:50 PM

epriestley created this revision.

Harbormaster completed remote builds in B25766: Diff 52094.Feb 4 2023, 1:50 PM

Slightly smaller diff.

Harbormaster completed remote builds in B25767: Diff 52095.Feb 4 2023, 1:51 PM

An easy optimization here is likely to avoid calling phutil_utf8v(...) on very long strings using a strategy like this:

function phutil_utf8_strlen_compare($string, $operator, $value) {
  // To compute the length of a UTF8 string, we currently must vectorize it
  // with "phutil_utf8v()". This operation has complexity "O(N)" on the
  // length of the string, so it is relatively expensive if the string
  // is long.

  // If we know which comparison we're performing, we can sometimes perform
  // the comparison in "O(1)" without vectorizing the string, since extreme
  // cases are trivially true or false, as we can easily find an upper and
  // lower bound for the character length of string from the byte length.

  // A byte string of length "N" has a maximum UTF8 character length of "N"
  // (if each byte is a one-byte character, like regular ASCII text).

  // A byte string of length "N" has a minimum UTF8 character length of
  // "ceil(N / 6)".

  $max_length = strlen($string);
  $min_length = (int)ceil($max_length / 6);

  switch ($operator) {
    case '<':
      if ($max_length < $value) {
        return true;
      }
      if ($min_length >= $value) {
        return false;
      }
      break;
    default:
      throw new Exception(
        pht(
          'Unknown string comparison operator "%s"!',
          $operator));
  }

  $actual_length = phutil_utf8_strlen($string);

  // ...