Are synonyms supported with KNN search?

Aishwarya · December 9, 2022, 12:52pm

How can we support synonyms with KNN OpenSearch?

jmazane · December 14, 2022, 10:37pm

Hi @Aishwarya, could you give an example of how you want to use KNN with synonyms?

Aishwarya · December 16, 2022, 9:57am

Hi @jmazane, So let’s say I have added a synonym-
universe => cosmos
Now, I want to search the keyword “universe” and I have created embeddings for the search term “universe” and used it in my search query with KNN to get results for the keyword “universe” which is working fine. But with synonyms, If I search for the term “universe” I should be able to get results for “cosmos” as well.

jmazane · January 3, 2023, 9:00pm

I see. For this I would recommend using the disjunctive max query type.

GET /my_knn_index/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "knn": {"field": {"vector": [a,...], "k": X} }},
        { "knn": {"field": {"vector": [b,...], "k": X} }},
      ]
    }
  }
}

From java docs:

“A query that generates the union of documents produced by its sub-queries, and that scores each document with the maximum score for that document as produced by any sub-query, plus a tie breaking increment for any additional matching sub-queries.”

jmazane · January 3, 2023, 9:03pm

But, using transformer based model in semantic search use cases, the vector representation should be similar for cosmos as it is for universe.

Aishwarya · January 6, 2023, 11:53am

Hi @jmazane,
We tried the solution and this is at the search query level. the subqueries would increase as the search terms would increase. For example, if we have N keywords in the search string and all of them would have synonyms so there would be N number of sub-queries in the search. Can it be handled at the mapping or indexing level just like the following works:

mshyani · January 6, 2023, 6:43pm

Ideally, decently sized transformer models should be able to handle synonyms automatically.

The word Universe is a bit tricky since the words probably get tokenized as Uni + verse. Both of these words have their own meaning that is quite common and so it is possible that the sum of these two vectors is different from the would-be vector of the single word “Universe” (which is a synonym of the word cosmos). So although disappointing, it is not improbable that the model is having trouble.

As is pointed out using manual rules can quickly become unscalable or not generalize well. If synonyms are really crucial one recommendation would be to fine-tune the model on a thesaurus dataset. Although we should be careful while file tuning since the training could get unstable and the model might become worse (catastrophic forgetting).

Aishwarya · January 23, 2023, 9:50am

I tried both the solutions mentioned above. With fine-tuning the model we faced the catastrophic forgetting issue. With the dis_max approach, it is giving relevant results but the manual rules are becoming unscalable and dis_max queries execute in series which will impact the performance.

If we have a search term like: “approximate nearest neighbor” and all three keywords have synonyms so there will be 9 queries with the combinations of keywords and their synonyms + 1 original search term in dis_max. Please let me know if my approach is not correct here.

Can we consider this as a feature request and look out for some options to support synonyms in the KNN search?

One approach could be to send only 1 query wrt KNN example:

Search keyword = [‘universe’, ‘cosmos’]; // universe with it’s synonym cosmos

a = [[‘vector of universe keyword’, ‘vector of universe keyword i.e. cosmos keyword’]]

{ “knn”: {“field”: {“vector”: [a,…]} }};

If KNN could internally handle this sub-array of vectors parallelly such that performance would not be impacted. Or maybe We can support OR operator support in the keywords somehow like universe | cosmos with KNN query.

Navneet · February 15, 2023, 7:00am

If the issue is about the performance how you want to scale the queries, why don’t you try _msearch. This will allow you to search in parallel. The only downside of this is you need to union the results at the client side.

From a feature standpoint that you are requesting, feel free to open a github issue here.

Aishwarya · March 2, 2023, 10:24am

@Navneet, Thanks for the suggestion I tried this approach, and my search speed performance has been affected adversely with _msearch.

Navneet · March 3, 2023, 3:21am

Yes with msearch because now you are running 2 queries in 1 single request in parallel.

If we implement anything natively in k-nn where we search for n vectors in 1 query it will have same issue. The parallelization needs to be done at k-nn plugin level. With msearch the parallelism is happening just above k-nn plugin level i.e. at opensearch corridnator node level, so both will have same performance problems.

Did you face performance issue while running single query msearch query or at some scale?

If it is happening at scale then I think some performance tuning can help here like increasing search and msearch threads.

Navneet · March 10, 2023, 2:04am

Created a GH issue for the feature request. [FEATURE]: Supporting Batch Query/array of vectors in K-NN Query · Issue #796 · opensearch-project/k-NN · GitHub

Ankita · April 20, 2023, 6:57am

Here we also have a use case where we would like to boost the k-NN query. It could be the actual query or a query that contains some specific keywords or maybe k-NN search from some specific fields.
[
{ “knn”: {“field”: {“vector”: a, “boost”: 5} }},
{ “knn”: {“field”: {“vector”: b} }};
]
Can we consider this as well?

Topic		Replies	Views
Efficient k-NN filtering with Neural Search OpenSearch	0	59	February 5, 2025
Combining KNN score with keyword query k-NN	8	3772	March 11, 2021
[BUG] Insufficient number of hits for nested knn queries with efficient filter #2347 k-NN	3	82	February 18, 2025
k-NN multiple field search in OpenSearch k-NN	6	2479	May 12, 2023
kNN Query incompatible with fields containing dots (.) k-NN	3	635	September 2, 2020

Are synonyms supported with KNN search?

Related topics