High Latency in KNN queries in v3.1

Versions - OpenSearch v2.19 and v3.1 (AWS managed cluster)

Describe the issue:

We migrated from OpenSearch v2.19 to v3.1 and changed from single-shard-per-tenant to multi-shard-per-tenant. Earlier each query hit one shard; now it hits multiple shards (scatter-gather).
The query is should with the KNN query and a has_parent join in the filter.

For the same query, shard 2 finishes in under 3s while shard 9 takes 15–18s.

We set index.knn.advanced.filtered_exact_search_threshold to -1 to disable exact search fallback, but the latency difference remains. Does setting this to -1 fully prevent exact search in all cases, or can exact search still run internally?

We saw issue #2249 mentioning NativeEngineKnnVectorQuery performing an exact search in create_weight for nested fields. In 3.1, is this expected even when filtered_exact_search_threshold is disabled and also for joins?

Could the 3.1 native engine changes (compared to 2.19 plugin behavior) explain why only one shard is very slow and other one is also slow comparatively where earlier it used to take under 200 ms?

Configuration:

OpenSearch 3.1 on AWS managed cluster
Faiss engine for k-NN
Nested field with parent-child relationship
Multi-shard per tenant
index.knn.advanced.filtered_exact_search_threshold set to -1

Relevant Logs or Screenshots:

Query profile shows shard 2 completing in under 3 seconds while shard 9 takes 15–18 seconds, with most of the time spent in create_weight / NativeEngineKnnVectorQuery execution.

https:// docs.opensearch. org/latest/vector-search/filter-search-knn/efficient-knn-filtering/
https:// github. com/opensearch-project/k-NN/issues/2936
profile 1 - Context – share whatever you see with others in seconds
profile 2 - Context – share whatever you see with others in seconds

The query is of form in both cases

{
 "docvalue_fields": [
  "child_object_source_id"
 ],
 "query": {
  "knn": {
   "knn-field-name": {
    "filter": {
     "bool": {
      "filter": [
       {
        "has_parent": {
         "parent_type": "parent",
         "query": {
          "bool": {
           "filter": [
            {
             "bool": {
              "minimum_should_match": 1,
              "should": [
               {
                "term": {
                 "tenant_id": "tenant-id"
                }
               },
               ... some more term filters
              ]
             }
            },
            {
             "bool": {
              "filter": [
               {
                "bool": {
                 "minimum_should_match": 1,
                 "should": [
                  {
                   "term": {
                    "object_type": {
                     "value": "parent-object-type"
                    }
                   }
                  }
                 ]
                }
               }
              ]
             }
            }
           ]
          }
         }
        }
       },
       {
        "bool": {
         "filter": [
          {
           "bool": {
            "minimum_should_match": 1,
            "should": [
             {
              "term": {
               "object_type": {
                "value": "child-object-type"
               }
              }
             }
            ]
           }
          }
         ]
        }
       },
       {
        "term": {
         "child_source_object_type": {
          "value": "parent-object-type"
         }
        }
       },
       {
        "term": {
         "object_type": {
          "value": "child-object-type"
         }
        }
       }
      ]
     }
    },
    "k": 10,
    "rescore": {
     "oversample_factor": 3
    },
    "vector": [...]
   }
  }
 },
 "sort": []
}

Profile 1 - https://traff.co/9VVpExCV

Profile 2 - https://traff.co/CAANYZIk

The other links expired.

@Frey0-0 Looking at the code, it seems that setting filtered_exact_search_threshold: -1 only disables one of three paths to exact search.

Looking at KNNWeight.java:

Path 1: Threshold-based (what -1 disables).

private boolean isFilteredExactSearchPreferred(final int filterIdsCount) {
    if (filterWeight == null) return false;
    int filterThresholdValue = KNNSettings.getFilteredExactSearchThreshold(...);

    if (isFilterIdCountLessThanK(filterIdsCount)) return true;  // still active

    if (isExactSearchThresholdSettingSet(filterThresholdValue)) { ... }  // -1 disables this

    return isMaxDistCompGreaterThanEstimatedDistComp(filterIdsCount);   // still active
}

Path 2: Filter count ≤ K (always active, ignores threshold).

Setting threshold to -1 does not disable this. If the has_parent filter matches fewer docs than k, exact search fires unconditionally.

Path 3: ANN returns < K results (post-ANN fallback).

private boolean isExactSearchRequire(...) {
    if (isFilteredExactSearchRequireAfterANNSearch(filterIdsCount, annResultCount)) {
        if (KNNSettings.isKnnIndexFaissEfficientFilterExactSearchDisabled(...)) return false;
        return true;  // fallback still happens
    }
}

To fully block all fallback paths, you would need both:

PUT /my-index/_settings
{
  "index.knn.advanced.filtered_exact_search_threshold": -1,
  "index.knn.faiss.efficient_filter.disable_exact_search": true
}

Even then, Path 2 (filter count ≤ k) does not appear to be skipped as it’s hardcoded.

Regarding nested field exact search issue, this appears to still be present in v3.1.

The nested expansion step always runs exact search. When expandNestedDocs is true (which it is for any nested field query), NativeEngineKnnVectorQuery always performs an exact search pass after ANN to retrieve all sibling documents of the top-K winners. This has nothing to do with compression or rescore settings, it is unconditional for nested queries. This was true before v3.1 and remains true in v3.1.

In v3.1, rescore adds a second exact search on top of that. If your index uses compression (4x or higher), a rescore phase now runs between the ANN search and the nested expansion. This rescore phase also calls getAllSiblings() when a parentsFilter is present, which expands to sibling docs and exact-searches them too.

So in v3.1 the execution for a nested KNN query with compression is:

  1. ANN search with an upsampled K (3x–5x your actual K)
  2. Rescore: expand to all sibling docs → exact search (new in v3.1)
  3. Nested expansion: expand to all sibling docs → exact search again (was already there)

Therefore setting "rescore": false in the query eliminates step 2 but step 3 always runs for nested fields. There is no setting that disables the nested expansion exact search, as it is a fundamental part of how parent-child KNN works, because the engine has no other way to score all the child documents belonging to each matched parent.

Q. Do native engine changes between v2.19 and v3.1 explain the shard latency variance:

In v2.19, every shard did the same amount of work. ANN search is fast and consistent, it traverses the HNSW graph and returns K results regardless of how many total documents are in the shard.

In v3.1, the amount of work per shard now depends on how many documents match your filter. Because of the new rescore + nested expansion exact search, each shard has to compare your query vector against every document that passes the filter, not just K documents. If one shard has 500 matching docs and another has 50,000 matching docs, the second shard does 100× more work.

Your shards don’t have equal data distribution, some shards likely hold more documents for certain tenants or object types (given the has_parent + tenant filter in the query). In v2.19 this didn’t matter because ANN ignored it. In v3.1 it matters a lot because exact search is linear, the slow shards are simply the ones with more data matching your filter.

Hope this helps

Hey Anthony,

Thankyou for the detailed response!

Had one more doubt regarding this :

Hey, can you clarify this statement?
This would only happen during exact search scenario right?
If exact search is happening all the time, then why are we even building the graph?
We can always do an exact search in case of parent-child/nested queries?

One more point we had the same documents returned after the filter step for both the shards (approx 600k, only difference is their in shard size between them)

Also if expand to sibling documents → exact search was already there why wasn’t it time consuming in v2.19? Didn’t understand this part exactly.
We have just added a rescore step which does the same thing again in v3.1

So in ideal scenario let’s say I get > K documents in path2 and path3 would this change affect latency in v3.1?

Hey @beerus_25 see the following replies based on my understanding:

“This would only happen during exact search scenario right?”

No. The nested sibling expansion and exact search runs unconditionally for any nested field query - it doesn’t matter whether the earlier ANN search triggered an exact search fallback or not. It’s a separate, always-on step specifically for nested docs. The code checks expandNestedDocs (a structural property of the query), not whether exact search happened.

“If exact search happens regardless, why build the HNSW graph?”

The HNSW graph is still doing the heavy lifting - it’s just doing a different job than you might think. Without it, you’d have to exact-search every single document in the index (potentially millions). What HNSW gives you is a small, tight candidate list - say 10-50 parent documents out of millions - and then you exact-search just their children. So HNSW isn’t avoiding exact search for nested queries, it’s dramatically shrinking the set of documents that exact search has to deal with.

“This would mean the time-consuming part already existed in v2.19 but it seems not.”

Correct - the retrieveAll exact search step existed in v2.19 too. The reason it wasn’t slow before is that in v2.19, HNSW returned exactly K candidates, so you’d expand the siblings of only K parents and exact-search that small set. In v3.1, the rescore step runs before retrieveAll with an upsampled K (3x–5x your actual K), so you’re expanding and exact-searching a much larger candidate set in the rescore phase. Then retrieveAll runs again on top of that. So v2.19 had one small exact search, v3.1 has two - and the first one is deliberately larger.

“Also v3.1 is adding a rescore step, which seems to be doing similar operations.”

Exactly right. The rescore step in v3.1 also calls getAllSiblings + exactSearch - it’s nearly identical to what retrieveAll does. So it would appear that for nested queries with compression enabled, you are running the sibling expansion + exact search twice per query.

“Latency would improve in scenarios where filtered results exceed K documents across all paths right?”

Partially. The filter-triggered exact search fallback (in KNNWeight) is indeed avoided when filtered results > K. With 600k matching docs and a typical K of 10, that fallback never triggers. So you’re not being hurt by that. What is hurting you is the rescore + retrieveAll path, which always runs for nested queries regardless of how many docs match the filter.

“Both shards returned ~600k matching documents after filtering, with differences in individual shard sizes.”

This is the key clue for why latency differs between shards. With 600k filtered docs on both shards, the filter-based exact search fallback is ruled out as the cause of variance. The difference is almost certainly the number of children per parent document across shards. When HNSW picks its K winners and the code expands to siblings, a shard where each parent has 500 children returns far more docs to exact-search than a shard where each parent has 50 children. The total filtered count being equal doesn’t mean the sibling density is equal - and sibling density is what drives the exact search cost.

Thanks again for the detailed response, it really helped us understand what is happening under the hood.

This might not be true for us as in this specific use case there is a 1:1 mapping between parent and child, so we are likely to get same count of sibling.

But overall I got your point of exact search being slower due to upsampling, so until we are able to remove nested queries we can always hit by these scenarios.

Also is there any reference document or github issue link or rfc as to why this decision was made to make the retrieveAll phase deliberately larger? Is it to improve recall?

@beerus_25 I don’t believe there is an rfc, however there are comments in the source code that explain the logic behind the changes.

It really boils down to a deliberate recall vs latency tradeoff: returning all children of matched parents (exact search on all siblings) gives correct, complete results. The alternative, Lucene’s DiversifyingChildren query, only returns the single best child per parent, which gives worse recall for use cases where you want all nested matches.