Are synonyms supported with KNN search?

How can we support synonyms with KNN OpenSearch?

Hi @Aishwarya, could you give an example of how you want to use KNN with synonyms?

Hi @jmazane, So let’s say I have added a synonym-
universe => cosmos
Now, I want to search the keyword “universe” and I have created embeddings for the search term “universe” and used it in my search query with KNN to get results for the keyword “universe” which is working fine. But with synonyms, If I search for the term “universe” I should be able to get results for “cosmos” as well.

I see. For this I would recommend using the disjunctive max query type.

GET /my_knn_index/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "knn": {"field": {"vector": [a,...], "k": X} }},
        { "knn": {"field": {"vector": [b,...], "k": X} }},
      ]
    }
  }
}

From java docs:

“A query that generates the union of documents produced by its sub-queries, and that scores each document with the maximum score for that document as produced by any sub-query, plus a tie breaking increment for any additional matching sub-queries.”

But, using transformer based model in semantic search use cases, the vector representation should be similar for cosmos as it is for universe.

Hi @jmazane,
We tried the solution and this is at the search query level. the subqueries would increase as the search terms would increase. For example, if we have N keywords in the search string and all of them would have synonyms so there would be N number of sub-queries in the search. Can it be handled at the mapping or indexing level just like the following works:

Ideally, decently sized transformer models should be able to handle synonyms automatically.

The word Universe is a bit tricky since the words probably get tokenized as Uni + verse. Both of these words have their own meaning that is quite common and so it is possible that the sum of these two vectors is different from the would-be vector of the single word “Universe” (which is a synonym of the word cosmos). So although disappointing, it is not improbable that the model is having trouble.

As is pointed out using manual rules can quickly become unscalable or not generalize well. If synonyms are really crucial one recommendation would be to fine-tune the model on a thesaurus dataset. Although we should be careful while file tuning since the training could get unstable and the model might become worse (catastrophic forgetting).

I tried both the solutions mentioned above. With fine-tuning the model we faced the catastrophic forgetting issue. With the dis_max approach, it is giving relevant results but the manual rules are becoming unscalable and dis_max queries execute in series which will impact the performance.

If we have a search term like: “approximate nearest neighbor” and all three keywords have synonyms so there will be 9 queries with the combinations of keywords and their synonyms + 1 original search term in dis_max. Please let me know if my approach is not correct here.

Can we consider this as a feature request and look out for some options to support synonyms in the KNN search?

One approach could be to send only 1 query wrt KNN example:

Search keyword = [‘universe’, ‘cosmos’]; // universe with it’s synonym cosmos

a = [[‘vector of universe keyword’, ‘vector of universe keyword i.e. cosmos keyword’]]

{ “knn”: {“field”: {“vector”: [a,…]} }};

If KNN could internally handle this sub-array of vectors parallelly such that performance would not be impacted. Or maybe We can support OR operator support in the keywords somehow like universe | cosmos with KNN query.

If the issue is about the performance how you want to scale the queries, why don’t you try _msearch. This will allow you to search in parallel. The only downside of this is you need to union the results at the client side.

From a feature standpoint that you are requesting, feel free to open a github issue here.

@Navneet, Thanks for the suggestion I tried this approach, and my search speed performance has been affected adversely with _msearch.

Yes with msearch because now you are running 2 queries in 1 single request in parallel.

If we implement anything natively in k-nn where we search for n vectors in 1 query it will have same issue. The parallelization needs to be done at k-nn plugin level. With msearch the parallelism is happening just above k-nn plugin level i.e. at opensearch corridnator node level, so both will have same performance problems.

Did you face performance issue while running single query msearch query or at some scale?

If it is happening at scale then I think some performance tuning can help here like increasing search and msearch threads.

Created a GH issue for the feature request. [FEATURE]: Supporting Batch Query/array of vectors in K-NN Query · Issue #796 · opensearch-project/k-NN · GitHub

1 Like

Here we also have a use case where we would like to boost the k-NN query. It could be the actual query or a query that contains some specific keywords or maybe k-NN search from some specific fields.
[
{ “knn”: {“field”: {“vector”: a, “boost”: 5} }},
{ “knn”: {“field”: {“vector”: b} }};
]
Can we consider this as well?