Search thread pool queue increasing rapidly after 3.1 upgrade — possible fetch-phase or timeout behaviour change

Versions AWS managed cluster version 3.1

Describe the issue: After upgrading our OpenSearch domain to v3.1, we started seeing increases in search thread pool queue depth ThreadpoolSearchQueue

Before the upgrade (on 2.19), our search queue stayed flat; immediately after, the queue began climbing during periods of higher load, leading to 460 (client closed connection) errors and logs like:

Failed to execute phase [fetch], SearchTask was cancelled;
shardFailures { ... OpenSearchRejectedExecutionException: cancelled task with reason: channel closed }

All of the logs we can see are related to the fetch phase (and not search phase). The latencies for searching in cloudwatch looked normal.

The problem seems to be isolated to one large index ~1.1 TB total, 54 shards, 27p/1r. We are using the index for semantic search.

When we disable queries to that index, thread pool metrics and latency both return to normal.

Configuration:

Cluster config:

  1. search.concurrent_segment_search.mode: none
  2. search_backpressure.node_duress.heap_threshold: 0.85
  3. No explicit cpu_threshold or search.low_level_cancellation - so it’s false
  4. 54 shards on 3 zones
  5. Average shard size ≈ 20 GB

Client config:

  1. Socket timeout 5s
  2. Connection request timeout 2s

Questions:

  • Did OpenSearch 3.1 introduce tighter transport timeout or cancellation behavior?
  • Could the new Async Shard Batch Fetch (here) or any other changes explain higher queue utilization?
    • Although it appears that the batch fetch relates to shard allocation/cluster management.
  • Are there any other changes that’d explain the longer fetch times?

Any thoughts would be greatly appreciated. Thanks.

Hey @gvdp ,

Giving this is a managed service, please reach out to AWS support to report the issue.

Leeroy.