Page MenuHomePhabricator

Implement phutil_is_utf8_with_only_bmp_characters() without segfaulting
ClosedPublic

Authored by epriestley on Feb 23 2014, 8:14 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, May 4, 6:22 PM
Unknown Object (File)
Tue, Apr 30, 11:15 PM
Unknown Object (File)
Sat, Apr 27, 10:25 AM
Unknown Object (File)
Wed, Apr 17, 2:34 PM
Unknown Object (File)
Thu, Apr 11, 7:07 AM
Unknown Object (File)
Tue, Apr 9, 6:26 AM
Unknown Object (File)
Sat, Apr 6, 10:08 PM
Unknown Object (File)
Mar 30 2024, 2:45 AM
Subscribers

Details

Summary

See comments. The regexp based implementation segfaults unpreventably on small inputs. Do it nice and slow in PHP instead.

Test Plan

Ran unit tests.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Unfortunate.

This test case doesn't reproduce for me (debian box), but the php bug indicates that capturing triggers the segfault. Does the same issue reproduce if capturing is disabled?

"/^(:?".
  ...
src/utils/utf8.php
72–73

This function could be condensed a bit (and possibly sped up) written with bitwise operators instead of comparison, but this looks good to me if you prefer this style. Likely more readable as-is, too.

Yeah, I wasn't able to get anything that had even approximately the same behavior. Here's a simple case on my machine:

>>> orbital ~ $ php -r "preg_match('/^(?:a)+$/', str_repeat('a', 1024 * 64));"
Segmentation fault: 11
>>> orbital ~ $ php -v
PHP 5.5.8 (cli) (built: Jan 21 2014 11:16:51) 
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2013 Zend Technologies

If it shows up in profiles we can provide an extension (T2312) or do something faster for short strings. This isn't really that slow, it's just dramatically slower than it would be in C.