Negative scores and duplicated results using Hybrid search

thalles.silva · March 14, 2024, 10:00pm

Version:
opensearch: 2.11.0
opensearch: 2.12.0

I am getting negative scores and duplicated results for hybrid search.

My configuration is as follows:

The index:

PUT /knn-sample-index
{
  "settings": {
    "index":{
       "knn":"true"
    }
  },
  "mappings": {
    "properties": {
      "textVector": {
        "type": "knn_vector",
        "dimension": 5,
        "method": {
          "engine": "faiss",
          "space_type": "innerproduct",
          "name": "hnsw",
          "parameters": {
            "ef_construction": 1024,
            "m": 64
          }
        }
      },
      "imageVector": {
        "type": "knn_vector",
        "dimension": 5,
        "method": {
          "engine": "faiss",
          "space_type": "innerproduct",
          "name": "hnsw",
          "parameters": {
            "ef_construction": 1024,
            "m": 64
          }
        }
      },
      "name": {
        "type": "text"
      }
    }
  }
}

The data (noticed that the vectors are l2 normalized):

# add data to index
PUT /knn-sample-index/_doc/1
{
"name": "Apple iPhone 13, 128GB, Pink - Unlocked (Renewed)",
"imageVector": [-0.5548,  0.3177,  0.4558, -0.5047,  0.3590],
"textVector": [-0.5313,  0.5175,  0.1438, -0.5471, -0.3605]
}

PUT /knn-sample-index/_doc/2
{
"name": "ASUS Chromebook Plus CX34 Laptop, 14 Display (1920x1080), Intel® Core i3-1215U Processor, 8GB RAM, 256GB UFS Storage, ChromeOS, White, CX3402CBA-DH386-WH",
"imageVector": [0.4884, 0.3328, 0.4026, 0.5094, 0.4786],
"textVector": [0.4003, 0.4308, 0.4791, 0.3798, 0.5295]
}

PUT /knn-sample-index/_doc/3
{
"name": "Sony 50 Inch 4K Ultra HD TV X85K Series: LED Smart Google TV with Dolby Vision HDR and Native 120HZ Refresh Rate KD50X85K- Latest Model, Black",
"imageVector": [0.4387, 0.4179, 0.4221, 0.5298, 0.4173]
}

PUT /knn-sample-index/_doc/4
{
"name": "Amazon Kindle Paperwhite (16 GB) – Now with a larger display, adjustable warm light, increased battery life, and faster page turns – Without Lockscreen Ads – Black",
"textVector": [0.4794, 0.4412, 0.4031, 0.4694, 0.4390]
}

PUT /knn-sample-index/_doc/5
{
"name": "SAMSUNG Galaxy S24+ Plus Cell Phone, 256GB AI Smartphone, Unlocked Android, 50MP Camera, Fastest Processor, Long Battery Life, US Version 2024 Cobalt Violet"
}

The query:

GET knn-sample-index/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "name": {
              "query": "Hi world"
            }
          }
        },
        {
          "knn": {
            "imageVector": {
              "vector": [1,1,1,1,1],
              "k": 3
            }
          }
        }
      ]
    }
  }
}

And the query output:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.073,
    "hits": [
      {
        "_index": "knn-sample-index",
        "_id": "1",
        "_score": -9549511700,
        "_source": {
          "name": "Apple iPhone 13, 128GB, Pink - Unlocked (Renewed)",
          "imageVector": [
            -0.5548,
            0.3177,
            0.4558,
            -0.5047,
            0.359
          ],
          "textVector": [
            -0.5313,
            0.5175,
            0.1438,
            -0.5471,
            -0.3605
          ]
        }
      },
      {
        "_index": "knn-sample-index",
        "_id": "1",
        "_score": -4422440400,
        "_source": {
          "name": "Apple iPhone 13, 128GB, Pink - Unlocked (Renewed)",
          "imageVector": [
            -0.5548,
            0.3177,
            0.4558,
            -0.5047,
            0.359
          ],
          "textVector": [
            -0.5313,
            0.5175,
            0.1438,
            -0.5471,
            -0.3605
          ]
        }
      },
      {
        "_index": "knn-sample-index",
        "_id": "1",
        "_score": -4422440400,
        "_source": {
          "name": "Apple iPhone 13, 128GB, Pink - Unlocked (Renewed)",
          "imageVector": [
            -0.5548,
            0.3177,
            0.4558,
            -0.5047,
            0.359
          ],
          "textVector": [
            -0.5313,
            0.5175,
            0.1438,
            -0.5471,
            -0.3605
          ]
        }
      },
      {
        "_index": "knn-sample-index",
        "_id": "1",
        "_score": 1.073,
        "_source": {
          "name": "Apple iPhone 13, 128GB, Pink - Unlocked (Renewed)",
          "imageVector": [
            -0.5548,
            0.3177,
            0.4558,
            -0.5047,
            0.359
          ],
          "textVector": [
            -0.5313,
            0.5175,
            0.1438,
            -0.5471,
            -0.3605
          ]
        }
      },
      {
        "_index": "knn-sample-index",
        "_id": "1",
        "_score": -9549511700,
        "_source": {
          "name": "Apple iPhone 13, 128GB, Pink - Unlocked (Renewed)",
          "imageVector": [
            -0.5548,
            0.3177,
            0.4558,
            -0.5047,
            0.359
          ],
          "textVector": [
            -0.5313,
            0.5175,
            0.1438,
            -0.5471,
            -0.3605
          ]
        }
      }
    ]
  }
}

Notice that there are negative scores and duplicated results.
If I take the hybrid search out and perform a simple knn query, it works fine.

Appreciate the help

NickBlow · April 17, 2024, 3:42pm

I had the same result until I added a search pipeline. Afterwards it all seemed to work as intended.

dswitzer2 · May 3, 2024, 7:53pm

I’m seeing duplicate results using the hybrid search as well:

I think there’s a problem with the hybrid scoring, because the same query run inside a hybrid search returns different results than run in standalone.

dswitzer2 · May 3, 2024, 8:53pm

In looking more closely, I think our issues are the same. What I’m seeing is that the duplicate entries all have large negative values (e.g. -9549512000, -4422440400, -9549512000, etc).

I tried applying a min_score to filter out those values, but that has no affect.

dswitzer2 · May 6, 2024, 2:25pm

I think your issue is that you’re missing a normalization-processor search pipeline. Try adding a temporary search pipeline to your query and see if that resolves the issue:

{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "name": {
              "query": "Hi world"
            }
          }
        },
        {
          "knn": {
            "imageVector": {
              "vector": [1,1,1,1,1],
              "k": 3
            }
          }
        }
      ]
    }
  },
  "search_pipeline" : {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "min_max"
          }
          , "combination": {
              "technique": "arithmetic_mean"
            , "parameters": {
              "weights": [0.3, 0.7]
            }
          }
          , "ignore_failure": false
        }
      }
    ]
  }
}

matt.heimer · July 2, 2024, 11:18pm

We saw the same issue. It turns out there are scenarios that keep it from working correctly.

Using search_type=dfs_query_then_fetch will cause the duplicates and negative score issue.

Setting a default pipeline on an index:

PUT /my_index/_settings 
{
  "index.search.default_pipeline" : "my_pipeline"
}

Will only work if you query the exact index name, using a wildcard that only matches a single index or even using an alias will cause the issue again when depending on the index.search.default_pipeline

Using /myindex/_search?search_pipeline=my-pipeline will work but it isn’t an option for us since we use the Java API and search_pipeline is not a valid query parameter yet, hopefully soon.

Include search_pipeline as a sibling of the query in your request (as shown in @dswitzer2 's last example) is the best option.

Topic		Replies	Views
Hybrid search returning duplicate docs Machine Learning troubleshoot	8	976	August 7, 2024
Hybrid Search Normalization for Nested Queries OpenSearch troubleshoot , configure	3	113	March 10, 2025
Unexpected Document Retrieval in Hybrid Search: Beyond BM25 and kNN OpenSearch	0	24	February 14, 2025
Elasticsearch Hybrid Query - No Results k-NN	8	4026	March 2, 2021
Can Hybrid queries be used to perform a federated search across 2 sources? OpenSearch	0	28	December 5, 2024

Negative scores and duplicated results using Hybrid search

Related topics