Versions (relevant - OpenSearch/Dashboard/Server OS/Browser)
2.19.3
Describe the issue:
Hello community,
I am experiencing a severe performance issue where creating a new index pattern in our standby/backup OpenSearch Dashboards causes the backup cluster to query the active/primary datacenter. This behavior generates massive resource consumption on the primary cluster.
Our Environment:
-
Primary Cluster: 15 nodes, actively receiving real-time data writes.
-
Standby/Backup Cluster: 5 nodes, syncing data from primary using Cross-Cluster Replication (CCR).
Our CCR remote configuration looks like this:
"remote": {
"proxy-to-standby": {
"mode": "proxy",
"server_name": "XXX",
"transport": {
"compress": "true"
},
"proxy_address": "XXX:9300"
}
}
The Problem
When a user attempts to create a new index pattern in OpenSearch Dashboards on the standby cluster, the system freezes. Upon investigation via _tasks API, we found that it triggers a massive amount of search queries targeting all indices (indices[*]).
Because of the cross-cluster connection, these queries get propagated back to the primary cluster. Even though the query source explicitly specifies "timeout":"30000ms", the timeout is ignored, and the tasks remain running non-stop (some went over 27 minutes according to _tasks output). Until I killled them by: POST _tasks/_cancel?actions=*search*
action,task_id,parent_task_id,type,start_time,timestamp,running_time,ip,node
indices:data/read/search,PiKX...hbQ:766080651,-,transport,1781788220483,13:10:20,27.9m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:766087466,-,transport,1781788246090,13:10:46,27.5m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:766088424,-,transport,1781788250485,13:10:50,27.4m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:766088586,-,transport,1781788251528,13:10:51,27.4m,XXX.XXX.XXX.XXX,XXX-node
(Note: There are dozens of these identical tasks eating up all primary cluster resources).
Individual Task Details
Here is the JSON payload of one of these stuck tasks. You can see it targets indices[*] and has a 30000ms timeout, yet its running_time_in_nanos indicates it has been running way past its threshold without being cancelled:
{
"completed": false,
"task": {
"node": "PiKX_fELQHusiepq5RLhbQ",
"id": 766080651,
"type": "transport",
"action": "indices:data/read/search",
"description": "indices[*], search_type[QUERY_THEN_FETCH], source[{\"size\":0,\"timeout\":\"30000ms\",\"track_total_hits\":2147483647,\"aggregations\":{\"indices\":{\"terms\":{\"field\":\"_index\",\"size\":200,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]}}}}]",
"start_time_in_millis": 1781788220483,
"running_time_in_nanos": 1730040465446,
"cancellable": true,
"cancelled": false,
"headers": {
"X-Opaque-Id": "b24277ed-011f-4bc9-bd28-1a337f36c396"
},
"resource_stats": {
"average": {
"cpu_time_in_nanos": 52226,
"memory_in_bytes": 12500
},
"total": {
"cpu_time_in_nanos": 11907625,
"memory_in_bytes": 2850152
},
"min": {
"cpu_time_in_nanos": 1763,
"memory_in_bytes": 360
},
"max": {
"cpu_time_in_nanos": 1634635,
"memory_in_bytes": 730488
},
"thread_info": {
"thread_executions": 228,
"active_threads": 0
}
}
}
}
As a result, the CPU utilization on our primary cluster master/nodes instantly spikes to 100% (Busy User) and stays completely maxed out for roughly 30 minutes until the tasks finally clear or get manual intervention.
Here is a metric screenshot showing the exact timeframe when the index pattern creation was initiated:
As you can see, the CPU gets completely saturated by user-space processes (Busy User maxing out at 99.9%) for a prolonged period.
Questions:
-
Why does creating an index pattern in Dashboards trigger a remote
indices[*]scan that breaks out of the local standby cluster context? -
Why is the
30000mstimeout defined in the query source completely ignored by the task coordinator/worker threads? -
Is there a known workaround (e.g., specific Dashboards settings or cluster settings) to prevent index pattern creation from querying remote/replicated clusters?
Any help or insights would be greatly appreciated. Thank you!





