Autocompletion by most popular phrases in the text

Let’s say I have many text fields indexed:

{
  "body": "The quick brown fox jumps over the lazy dog"
}

{
  "body": "There was a quick brown fox in a forest"
}

{
  "body": "Forest was a most fun place for a lazy dog to play."
}

{
  "body": "He wants a lazy dog to play."
}
...

I would like to implement a search phrase suggestion by the phrases in the text that are indexed. Let’s say if an user types qui the suggestions show quick. If user types quick it would suggest quick brown, quick brown fox etc.

OOTB autocomplete feature seems to give me whole body field, not just phrase. search_as_you_type does the same. Term suggesters suggests just one word, phrase suggesters just tries to fix typos. Nothing that I have tried seemed to give me the result I need.

Is there a way to do this, at first glance, simple task?

It sounds like you’ll need to write some code. Here are some options that come to mind:

  • autocomplete: it gives you the whole body indeed, but it’s up to you what that whole body is. If you index all possible combinations (hacky, I know), then it’s the fastest way to go about it (at least that I can think of)
  • plain queries: you can do a match_phrase_prefix with a highlighter. “qui” should give you “quick” highlighter, “quick br” will get you “quick brown”… not sure how you can get “quick brown” from “quick” though. I guess you’ll have to take the next word manually from the highlight snippet
  • term aggregation with an include regex on a shingle field: if your analyzer is simple (e.g. whitespace+lowercase), this might be a good poor man’s solution. For example, if you have shingle sizes of two (i.e. pairs of two words) AND you also index individual tokens AND you can afford to have field data on that text field (watch out the heap usage - if that’s unacceptable you can shingle in your app before indexing and have an array of keywords… even uglier). Then you can have do some parting on the query side like this:

** if the user types one word (e.g. qui or quick - you don’t know which one is a complete word!), then you query for everything and have a terms aggregation including TYPED_WORD.+|TYPED_WORD\s.+. This should give you quick for qui and quick brown and other two-word combinations for quick.
** if the user types in multiple words, then you can search for that phrase (e.g. match_phrase_prefix on “quick br”) and then facet on the last word. Though it’s risky because whatever br.+ returns might not be right after quick. I guess you can have shingles of three and then be more precise, searching for quick br.+ in the facet and you’ll have a smaller problem when you get to three words: quick brown f would phrase search for that and then the facet on brown f.+ will be quite likely to match stuff from that phrase.

Just a qui.+ brain dump here, because I don’t think it’s such a simple problem. It depends on what you consider to be a “word” as well. So far, I think the query + highlighter is good enough (simple to implement, accurate, fast…).

Thanks. Your answer cleared my mind that I’m definitely not overlooking some out-of-the-box solution. I ended up having separate back-end job that takes all the fields that needs to have suggestions made from. Then extracts keywords from these fields using a 3rd party NLP-like library, that basically removes stop words, punctuation marks, etc and returns sequences of non-stopwords that form a key phrase.

Finally it indexes those phrases to a separate index including pre-calculated frequency of how often they occur, so that I can use prefix phrase match query and order by frequency for better suggestion results.

1 Like