Page MenuHomePhabricator

Add Ferret support to Paste
ClosedPublic

Authored by epriestley on Jul 15 2019, 7:42 PM.

Details

Summary

Ref PHI1292. Enable fulltext searchs in paste. Maybe this should only index a snippet instead of the entire content?

Also updates table names in PhabricatorPasteQuery.

Test Plan

Created some pastes, indexed them, searched for them.

Diff Detail

Repository
rP Phabricator
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

amckinley created this revision.Jul 15 2019, 7:42 PM
Owners added a subscriber: Restricted Owners Package.Jul 15 2019, 7:42 PM
amckinley added inline comments.Jul 15 2019, 7:43 PM
src/applications/paste/engine/PhabricatorPasteFulltextEngine.php
17

It looks like if I use $paste->getTitle() and that returns null, the document gets indexed without errors, but it doesn't come back in the search results?

amckinley requested review of this revision.Jul 15 2019, 7:44 PM

I'm a little uneasy about indexing the actual content, since I worry this will lead to a tragic event like "we learn that many installs routinely send 1GB logfiles consisting mostly of /dev/urandom output into Paste".

Ferret seems to be performing better than MyISAM/InnoDB FULLTEXT did, but I'm not confident it will stand up to use cases like this (see also T7472). Even if we try to adopt Ferret for this "codebase search" use case in the future, I'd probably like to separate the index and/or apply special rules around discarding documents that are too large or, uh, "too random" or whatever, and have this failure be non-silent ("Some documents matching all query parameters other than your fulltext query terms are really dumb and are not indexed by the fulltext engine, so they might or might not match your query. Click here for a list." or whatever, I guess, although this isn't great).

An especially cheap way to deal with this for now would be to just not index the content. This might be a bit confusing/surprising, but even a title-only search is better than nothing, so I think it's still a step forward.

Let me look at the null title thing, since that seems like it's probably a bug somewhere, and indexing the English-language untitled document string is probably not ideal.

src/applications/paste/engine/PhabricatorPasteFulltextEngine.php
24

Slowvote copy/paste mishap.

epriestley retitled this revision from Add Ferret support to Pastebin to Add Ferret support to Paste.Jul 18 2019, 5:01 PM
epriestley edited the summary of this revision. (Show Details)

(Also, "Pastebin" is the name of a product/company and our database name really shouldn't be pastebin, it just is since it was a contributed patch a million years ago and database names are a pain to change. This is basically like having a database named phabricator_yelp or whatever, though.)

..database names are a pain to change..

Let me also see exactly how much of a pain this is, I think we did it once and this patch definitely makes it harder.

That wasn't as nearly as bad as I thought.

(Also, "Pastebin" is the name of a product/company and our database name really shouldn't be pastebin, it just is since it was a contributed patch a million years ago and database names are a pain to change. This is basically like having a database named phabricator_yelp or whatever, though.)

Okay I know this review is like a year old, but... in my defense about Paste/Pastebin, Wikipedia agrees with me that pastebin is a generic term. And it's in wiktionary which, of course, makes it true because wiktionary is never wrong, ever. 😆

epriestley commandeered this revision.Apr 16 2020, 8:56 PM
epriestley edited reviewers, added: amckinley; removed: epriestley.

Haha -- well, yelp is a generic term, too!

This actual patch is better motivated now (see T13503) and Ferret seems stable and scalable. The paste database rename went through and I'm pretty sure the "no title" thing got fixed (D20660) so I think this is ripe now.

epriestley updated this revision to Diff 50322.Apr 16 2020, 9:06 PM

This didn't need much:

  • Timeshift the database patches up to the present day.
  • Update database name to "paste" in patches and FerretEngine.
  • Fix copy/paste mishap from Slowvote.
  • Index getTitle(), which may be null, as the document title. This is fine, and works with "title:-" (the new "field absence" operator) now.

Tested by:

  • Rebuilding indexes with bin/search index --type paste.
  • Searching by title and body content.
  • Using "title:-" to find pastes with no title.
This revision was not accepted when it landed; it landed in state Needs Review.Apr 16 2020, 9:10 PM
Closed by commit rPef1340bd3207: Add Ferret support to Paste (authored by amckinley, committed by epriestley). · Explain Why
This revision was automatically updated to reflect the committed changes.