Hybrid Query Explanation: Only a single min_max_normalization block

Versions:

OpenSearch 3.4.0 (Docker)

Dashboard 3.4.0 (Docker)

Describe the issue:

When running a hybrid query composed of a query_string and a Lucene knn query (cosine similarity) in explain mode, I get the following structure:

{
  "_explanation": {
    "value": 1,
    "description": "arithmetic_mean, weights [0.3, 0.7] combination of:",
    "details": [
      {
        "value": 1,
        "description": "min_max normalization of:",
        "details": [
          {
            "value": 3.8993959426879883,
            "description": "combined score of:",
            "details": [
              {
                "value": 3.899396,
                "description": "weight(name:wind in 10513) [PerFieldSimilarity], result of:",
                "details": []
              },
              {
                "value": 0.78645265,
                "description": "within top 10 docs",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}


Note that there is a single “min_max normalization of“ but I’d expect two: one for each query score that are then combined. This differs from the example from the docs:

{
  "_explanation": {
    "value": 0.9251075,
    "description": "arithmetic_mean combination of:",
    "details": [
      {
        "value": 1.0,
        "description": "min_max normalization of:",
        "details": []
      },
      {
        "value": 0.8503647,
        "description": "min_max normalization of:",
        "details": [
          {
            "value": 0.015177966,
            "description": "within top 5",
            "details": []
          }
        ]
      }
    ]
  }
}

The docs use a different hybrid query composed of a match and a neural query (embeddings are calculated by OpenSearch). In my case, the embeddings are calculated externally and are then provided with the query.

Does this mean that hybrid queries’ scores with a knn part are not calculated as usually? The usual way being (see also this post):

  • min-max-normalization for each hybrid query part’s score (0-1)
  • combination of these scores to the overall score (weighted arithmetic mean: multiply each query part’s normalized score with the respective weight and sum up)

If the score is calculated differently for knn query parts (i.e. both scores are summed up first and are then normalized), wouldn’t that mean that the knn score would be underrepresented as opposed to the text query part’s score since cosine similarity ranges from 0-1 in OpenSearch even before normalization, see docs? If yes, is there a way to enforce a different behavior?

Configuration:

Mapping (knn part):

        "embedding": {
          "type": "knn_vector",
          "dimension": 256,
          "method": {
            "engine": "lucene",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {
              "ef_construction": 128,
              "m": 16,
              "encoder": {
                "name": "sq",
                "parameters": {
                  "confidence_interval": 0.9
                }
              }
            }
          }
        }

Normalization pipeline:
PUT /_search/pipeline/nlp-search-pipeline

{
  "description": "Post processor for hybrid search with custom weights",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [0.3, 0.7]
          }
        }
      }
    }
  ],
  "response_processors": [
    {
      "hybrid_score_explanation": {}
    }
  ]
}

Query:

GET myindices/_search?search_pipeline=nlp-search-pipeline&explain=true&explain=true

{
  "fields": ["name"],
  "_source": {
    "excludes": ["*"]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "query_string": {
            "query": "wind",
            "default_field": "name",
            "_name": "text_branch"
          }
        },
        {
          "knn": {
            "embedding": {
              "vector": [...],
              "k": 10,
              "_name": "vector_branch"
            }
          }
        }
      ]
    }
  },
  "size": 10,
  "from": 0
}

Relevant Logs or Screenshots:

Hi @tobe,

Thank you for detailed report with the query, pipeline configuration, and explain output. We’ve been looking into this.

To answer your questions:

1. There is no special handling of knn queries in the hybrid query explain path. Hybrid query uses the same mechanism for all sub-queries — each sub-query’s Weight produces an Explanation via Lucene’s standard explain() method, and these are assembled into the hybrid explanation. KNN query’s explain goes through the same code path as the query_string explain. There’s no differentiation or special-casing for knn.

2. Regarding knn scores and normalization: You’re correct that cosine similarity scores are already in the [0, 1] range. However, min-max normalization is still applied across the result set — it normalizes relative to the min and max scores within each sub-query’s result set, not the theoretical bounds. So even cosine similarity scores will be re-scaled based on the actual min/max observed in the results.

3. We have identified an area in the code that may be causing the behavior you’re seeing. The explanation response processor has logic that filters explanation details for sub-queries where a document has a zero raw score (i.e., the document was not found by that particular sub-query). This is by design for cases where a document matches one sub-query but not another, but it may be producing confusing output when combined with the combination description that lists weights for all sub-queries.

We are actively investigating this further. To help us narrow down the root cause, could you help us with the following:

  1. Is the _explanation output in your post from a single document, or did you combine parts from different hits? Specifically, the “combined score of:” section (showing both weight(name:wind) and within top 10 docs) — is that from the same document as the arithmetic_mean, weights [0.3, 0.7] combination of: section that shows only one normalization block?

  2. Could you run the same query with explain=true but WITHOUT the hybrid_score_explanation response processor? This means using a search pipeline with only the normalization-processor (removing the response_processors section). This will show us the raw query-level explain output before the explanation processor transforms it.

  3. Do ALL hits in your response show only one normalization block, or do some hits show two? If some show two, could you share one of those as well?

  4. How many documents does your index have in total?

This information will help us determine whether this is a display issue in the explanation processor, or something deeper in how knn sub-query scores are being handled.

Thanks, Martin

Hi @martin.g,

Thanks for looking into this and your answers. Regarding answer 3, I think the example that I provided above shows a match for both subqueries. Otherwise I figured the knn part would say “not in top k docs“.

Here is the further information you requested:

  1. The _explanation output as provided above is from one single hit.

PUT /_search/pipeline/nlp-search-pipeline

    {
  "description": "Post processor for hybrid search with custom weights",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [0.3, 0.7]
          }
        }
      }
    }
  ]
}

which returns

{
  "_explanation": {
    "value": 4.6858487,
    "description": "sum of:",
    "details": [
      {
        "value": 3.8993959426879883,
        "description": "combined score of:",
        "details": [
          {
            "value": 3.899396,
            "description": "weight(name:wind in 10513) [PerFieldSimilarity], result of:",
            "details": [
              {
                "value": 3.899396,
                "description": "score(freq=1.0), computed as boost * idf * tf from:",
                "details": [
                  {
                    "value": 6.6268253,
                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details": [
                      {
                        "value": 18,
                        "description": "n, number of documents containing term",
                        "details": []
                      },
                      {
                        "value": 13968,
                        "description": "N, total number of documents with field",
                        "details": []
                      }
                    ]
                  },
                  {
                    "value": 0.58842593,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                      {
                        "value": 1,
                        "description": "freq, occurrences of term within document",
                        "details": []
                      },
                      {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                      },
                      {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                      },
                      {
                        "value": 3,
                        "description": "dl, length of field",
                        "details": []
                      },
                      {
                        "value": 6.759307,
                        "description": "avgdl, average length of field",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 0.78645265,
            "description": "within top 10 docs",
            "details": []
          }
        ]
      },
      {
        "value": 0,
        "description": "match on required clause, product of:",
        "details": [
          {
            "value": 0,
            "description": "# clause",
            "details": []
          },
          {
            "value": 1,
            "description": "FieldExistsQuery [field=_primary_term]",
            "details": []
          }
        ]
      }
    ]
  }
}

There seems to be an additional object JSON in the top-level details object.
Note that this is not the same hit as in the example above (doing this manually makes it hard to keep track, see below).

  1. For ten results, there is just one min max block per hit. I tried this with k = 100, and I count as many min max blocks as hits (using the original search pipeline).

  2. I initially tried this on a single small index with 13’968 docs and then re-ran the same query over all indices (22 indices in total, sizes differ per index as this corresponds to sources or providers in our ETL pipeline) with 4’895’598 docs in total using the wildcard pattern *. I do not see any difference regarding the response’s structure.

This what I can provide for now. I did this using the dev console. I think for a more systematic approach, I’d need to put the steps in a script, so this could be run automatically.

Let me know if this helps or additional details are needed. Thank you very much.

PS: Actually, I observed this already last year but then all of a sudden, I got the structure I expected, see Hybrid Score Explain Output's Structure Diverges from Docs - #2 by tobe I cannot explain this change in behaviour.

Usually, you only get two normalization blocks if the document is returned by both the keyword search and the k-NN search. If a document only appears in one of the result sets (e.g., it wasn’t in the top K for the vector search), the explanation often simplifies or omits the second block because there is nothing to combine.