Page MenuHomePhabricator

Tokenize hyphens for typeahead
Closed, ResolvedPublic

Description

Right now if you name a project:

I Love Google

You can 'discover' this via typehead when assigning projects by typing 'Google'.

If you name something:

I-Love-Google

It doesn't work the same way. It would be great if it did :)

Event Timeline

chasemp raised the priority of this task from to Needs Triage.
chasemp updated the task description. (Show Details)
chasemp added projects: Wikimedia, Maniphest.
chasemp added a subscriber: chasemp.
joshuaspence renamed this task from tokenize hyphens for typeahead to Tokenize hyphens for typeahead.Jul 28 2014, 8:51 PM

Is the hyphen thing a common convention?

I feel like this should be done via a substring search as opposed to treating hyphens like spaces; in theory other delimiters could be used or no delimiters at all.

I think the tokenizer was originally written for human names (since it evolved out of Facebook code), where "Conway-Jacobs" and "O'Shannessy" are unambiguously one token.

Now that it's used for everything, these rare human-name edge cases aren't unambiguous.

This feature is being asked for to implement a hacky/workaround version of T3670, but regardless of that use case I think it's reasonable to tokenize a little more aggressively and match "iPhone Backend-Engineering" when you type "engineering".

We could also swap to substring search, but then, e.g., "van" will hit "evan", when you probably do not want that match.

We also can't build performant indexes in MySQL for arbitrary substrings.

The fix here is likely:

  • Adjust JX.TypeaheadNormalizer.normalize() to replace hyphens with spaces instead of with empty space durning normalization.
  • Make PhabricatorTypeaheadDatasource::tokenizeString() do the same.
    • (For bonus points: align tokenizeString() slightly better to the web frontend behavior, since there are currently behavioral differences on other strings.)
  • Test by naming a project "X-Y" and then finding it by typing "Y" in a typeahead.

Didn't claim my bonus points yet but got a naive version of this going. I am using "Frontend-Engineering", "Backend-Engineering", and "Engineering".

Starting from a clear browser cache...

  • If I type in "Eng" I get Engineering and not the other two. :/
  • However, if I prime the typeahead first by typing "Fro", clearing it, then doing "Bac", clearing it, and THEN typing "Eng", it works nicely, with all three showing in alphabetical order.

Long story short, I *think* the issue is the various "token" tables will need to be re-populated with the new PhabricatorTypeaheadDatasource::tokenizeString() Does this sound right, and if so do we already have an easy way to do this? If not, I'll write a migration script I guess as part of this diff.

Does this sound right

Yep.

and if so do we already have an easy way to do this?

Not really -- edit and save the project I guess? But you might have to make an actual change.

You could also make PhabricatorProjectSearchIndexer call the update method, then use bin/search index --type PROJ.

(I don't think it's a big deal if we don't migrate the tokens, especially if I can point users at bin/search index.)