Ref T6223. Two issues:
- We don't use /u mode on these regexps. Without /u, the \w/\W/\s/\S modifiers have bad behavior on non-ASCII bytes. Add the flag to use unicode mode, making \w and \s behave like we expect.
- We might possibly want to do something different here eventually (for example, if the /u flag has some huge performance penalty) but this seems OK for now.
- We use \b (word boundary) to terminate the match, but ? is not a word character. Use (?!\w) instead ("don't match before a word character") which is what we mean.