Page MenuHomePhabricator

Remove call to PHP "utf8_decode()" in "phutil_utf8_strlen()"
ClosedPublic

Authored by epriestley on Feb 4 2023, 1:50 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Mar 26, 11:01 PM
Unknown Object (File)
Sat, Mar 23, 6:28 PM
Unknown Object (File)
Tue, Mar 19, 8:10 PM
Unknown Object (File)
Sun, Mar 17, 3:21 AM
Unknown Object (File)
Thu, Mar 14, 5:56 PM
Unknown Object (File)
Wed, Mar 13, 10:46 PM
Unknown Object (File)
Feb 18 2024, 7:10 AM
Unknown Object (File)
Feb 16 2024, 5:54 AM
Subscribers
None

Details

Summary

Ref T13588. See PHI2228. Under PHP8.2 and newer, calls to "utf8_decode()" raise a deprecation warning.

The behavior of this function probably isn't great under any PHP version since it maps UTF8 space into ISO-8859-1 space, which isn't an operation you can really implement "correctly" in the general case. For the specific narrow case of counting characters, as here, you could probably do worse, but I think the PHP upstream's deprecation of this function is entirely reasonable.

The fallback implementation for strlen, "phutil_utf8v()", should produce the correct result in all cases. The downside is that it's runtime UTF8 parsing in PHP, which may be slow. This function is called rarely and I don't think this will present a problem, but if it does I'd rather address it once it arises.

Test Plan

Under PHP8.2, ran "arc lint". After patch, no longer saw deprecation warning.

Diff Detail

Repository
rARC Arcanist
Branch
utf81
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 25767
Build 35595: arc lint + arc unit

Event Timeline

epriestley created this revision.

An easy optimization here is likely to avoid calling phutil_utf8v(...) on very long strings using a strategy like this:

function phutil_utf8_strlen_compare($string, $operator, $value) {
  // To compute the length of a UTF8 string, we currently must vectorize it
  // with "phutil_utf8v()". This operation has complexity "O(N)" on the
  // length of the string, so it is relatively expensive if the string
  // is long.

  // If we know which comparison we're performing, we can sometimes perform
  // the comparison in "O(1)" without vectorizing the string, since extreme
  // cases are trivially true or false, as we can easily find an upper and
  // lower bound for the character length of string from the byte length.

  // A byte string of length "N" has a maximum UTF8 character length of "N"
  // (if each byte is a one-byte character, like regular ASCII text).

  // A byte string of length "N" has a minimum UTF8 character length of
  // "ceil(N / 6)".

  $max_length = strlen($string);
  $min_length = (int)ceil($max_length / 6);

  switch ($operator) {
    case '<':
      if ($max_length < $value) {
        return true;
      }
      if ($min_length >= $value) {
        return false;
      }
      break;
    default:
      throw new Exception(
        pht(
          'Unknown string comparison operator "%s"!',
          $operator));
  }

  $actual_length = phutil_utf8_strlen($string);

  // ...

...but I'd like evidence that this is a useful optimization before pursuing it.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 4 2023, 1:54 PM
This revision was automatically updated to reflect the committed changes.
epriestley edited the summary of this revision. (Show Details)