Search thread pool queue increasing rapidly after 3.1 upgrade — possible fetch-phase or timeout behaviour change

gvdp · October 27, 2025, 1:25pm

Versions AWS managed cluster version 3.1

Describe the issue: After upgrading our OpenSearch domain to v3.1, we started seeing increases in search thread pool queue depth ThreadpoolSearchQueue

Before the upgrade (on 2.19), our search queue stayed flat; immediately after, the queue began climbing during periods of higher load, leading to 460 (client closed connection) errors and logs like:

Failed to execute phase [fetch], SearchTask was cancelled;
shardFailures { ... OpenSearchRejectedExecutionException: cancelled task with reason: channel closed }

All of the logs we can see are related to the fetch phase (and not search phase). The latencies for searching in cloudwatch looked normal.

The problem seems to be isolated to one large index ~1.1 TB total, 54 shards, 27p/1r. We are using the index for semantic search.

When we disable queries to that index, thread pool metrics and latency both return to normal.

Configuration:

Cluster config:

search.concurrent_segment_search.mode: none
search_backpressure.node_duress.heap_threshold: 0.85
No explicit cpu_threshold or search.low_level_cancellation - so it’s false
54 shards on 3 zones
Average shard size ≈ 20 GB

Client config:

Socket timeout 5s
Connection request timeout 2s

Questions:

Did OpenSearch 3.1 introduce tighter transport timeout or cancellation behavior?
Could the new Async Shard Batch Fetch (here) or any other changes explain higher queue utilization?
- Although it appears that the batch fetch relates to shard allocation/cluster management.
Are there any other changes that’d explain the longer fetch times?

Any thoughts would be greatly appreciated. Thanks.

Leeroy · October 28, 2025, 2:52pm

Hey @gvdp ,

Giving this is a managed service, please reach out to AWS support to report the issue.

Leeroy.

Topic		Replies	Views
CPU usage get spiked(100%) intermittently in OpenSearch cluster and it causing all the search operation to fail OpenSearch discuss , troubleshoot , upgrade	2	964	August 29, 2023
Info about normal_transfer_queue_consumer Thread Pool in OpenSearch OpenSearch	1	77	August 5, 2025
Performance Issues while Snapshot is taken OpenSearch	5	121	September 25, 2025
Opensearch Performance tuning OpenSearch	6	1004	October 10, 2023
Open search 2.16 performance issues OpenSearch	2	159	October 8, 2024

Search thread pool queue increasing rapidly after 3.1 upgrade — possible fetch-phase or timeout behaviour change

Related topics