Cross-Cluster Replication: Creating index pattern in standby cluster triggers un-killable, long-running search queries on primary cluster (indices[*])

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser)

2.19.3

Describe the issue:

Hello community,

I am experiencing a severe performance issue where creating a new index pattern in our standby/backup OpenSearch Dashboards causes the backup cluster to query the active/primary datacenter. This behavior generates massive resource consumption on the primary cluster.

Our Environment:

  • Primary Cluster: 15 nodes, actively receiving real-time data writes.

  • Standby/Backup Cluster: 5 nodes, syncing data from primary using Cross-Cluster Replication (CCR).

Our CCR remote configuration looks like this:

"remote": {
  "proxy-to-standby": {
    "mode": "proxy",
    "server_name": "XXX",
    "transport": {
      "compress": "true"
    },
    "proxy_address": "XXX:9300"
  }
}

The Problem

When a user attempts to create a new index pattern in OpenSearch Dashboards on the standby cluster, the system freezes. Upon investigation via _tasks API, we found that it triggers a massive amount of search queries targeting all indices (indices[*]).

Because of the cross-cluster connection, these queries get propagated back to the primary cluster. Even though the query source explicitly specifies "timeout":"30000ms", the timeout is ignored, and the tasks remain running non-stop (some went over 27 minutes according to _tasks output). Until I killled them by: POST _tasks/_cancel?actions=*search*

action,task_id,parent_task_id,type,start_time,timestamp,running_time,ip,node
indices:data/read/search,PiKX...hbQ:766080651,-,transport,1781788220483,13:10:20,27.9m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:766087466,-,transport,1781788246090,13:10:46,27.5m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:766088424,-,transport,1781788250485,13:10:50,27.4m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:766088586,-,transport,1781788251528,13:10:51,27.4m,XXX.XXX.XXX.XXX,XXX-node

(Note: There are dozens of these identical tasks eating up all primary cluster resources).

Individual Task Details

Here is the JSON payload of one of these stuck tasks. You can see it targets indices[*] and has a 30000ms timeout, yet its running_time_in_nanos indicates it has been running way past its threshold without being cancelled:

{
  "completed": false,
  "task": {
    "node": "PiKX_fELQHusiepq5RLhbQ",
    "id": 766080651,
    "type": "transport",
    "action": "indices:data/read/search",
    "description": "indices[*], search_type[QUERY_THEN_FETCH], source[{\"size\":0,\"timeout\":\"30000ms\",\"track_total_hits\":2147483647,\"aggregations\":{\"indices\":{\"terms\":{\"field\":\"_index\",\"size\":200,\"min_doc_count\":1,\"shard_min_doc_count\":0,\"show_term_doc_count_error\":false,\"order\":[{\"_count\":\"desc\"},{\"_key\":\"asc\"}]}}}}]",
    "start_time_in_millis": 1781788220483,
    "running_time_in_nanos": 1730040465446,
    "cancellable": true,
    "cancelled": false,
    "headers": {
      "X-Opaque-Id": "b24277ed-011f-4bc9-bd28-1a337f36c396"
    },
    "resource_stats": {
      "average": {
        "cpu_time_in_nanos": 52226,
        "memory_in_bytes": 12500
      },
      "total": {
        "cpu_time_in_nanos": 11907625,
        "memory_in_bytes": 2850152
      },
      "min": {
        "cpu_time_in_nanos": 1763,
        "memory_in_bytes": 360
      },
      "max": {
        "cpu_time_in_nanos": 1634635,
        "memory_in_bytes": 730488
      },
      "thread_info": {
        "thread_executions": 228,
        "active_threads": 0
      }
    }
  }
}

As a result, the CPU utilization on our primary cluster master/nodes instantly spikes to 100% (Busy User) and stays completely maxed out for roughly 30 minutes until the tasks finally clear or get manual intervention.

Here is a metric screenshot showing the exact timeframe when the index pattern creation was initiated:

As you can see, the CPU gets completely saturated by user-space processes (Busy User maxing out at 99.9%) for a prolonged period.

Questions:

  1. Why does creating an index pattern in Dashboards trigger a remote indices[*] scan that breaks out of the local standby cluster context?

  2. Why is the 30000ms timeout defined in the query source completely ignored by the task coordinator/worker threads?

  3. Is there a known workaround (e.g., specific Dashboards settings or cluster settings) to prevent index pattern creation from querying remote/replicated clusters?

Any help or insights would be greatly appreciated. Thank you!

@vnovotny98 I’m having a difficult time reproducing the issue you described.

Can you run the following on the follower cluster please and provide the output:

GET _cluster/settings?include_defaults=true&filter_path=*.cluster.remote

Also, do you have data sources enabled. Can you go to “Dashboard Management” → “Data sources”, is there anything listed there?

Can you also confirm what index pattern is user on the stand by dashboards where creating index pattern?

Hi @Anthony

input:

{
  "persistent": {
    "cluster": {
      "remote": {
        "proxy-to-xxx": {
          "mode": "proxy",
          "server_name": "xxx",
          "transport": {
            "compress": "true"
          },
          "proxy_address": "xxxx:9300"
        }
      }
    }
  },
  "defaults": {
    "cluster": {
      "remote": {
        "node": {
          "attr": ""
        },
        "initial_connect_timeout": "30s",
        "connect": "true",
        "connections_per_cluster": "3"
      }
    }
  }
}

Also, do you have data sources enabled. Can you go to “Dashboard Management” → “Data sources”, is there anything listed there?

No items found. No we don’t use any other Data Sources.

I was the one who created index patterns. I created like 12 new ones on the follower cluster and after that I started to see problems on Master cluster and then I tried it again two more times to add index patterns and every time I clicked on Create index pattern it affected our Master cluster.

@vnovotny98 do you see the issue when you click “Create index pattern” that just opens the wizard to create it, or after you have entered all the details (pattern) and click the final “Create index pattern”? and if so, what is the index pattern you are trying? Is it any?

Can you also confirm you down have anything configured in Index pattern placeholder in Dashboard Management → Advanced Settings

@Anthony

I think it’s default.

My config is same for both clusters.

The problem starts right when I click Create index pattern. Right here:

Right now I can see my CPU on Master cluster raise.

after killing these process with:
POST _tasks/_cancel?actions=search

it goes to normal.

When I try to create New Index Pattern on Master cluster, there is no peak. That’s normal.

@vnovotny98 are you able to list the 12 index patterns you have created? Are any of them referencing leader sluster?

Yes, I am able to to list indices and their mapping and setting by GET in DEV TOOLS.

I can see their settings:

    "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        "routing": {
          "allocation": {
            "require": {
              "temp": "hot"
            }
          }
        },
        "number_of_shards": "1",
        "translog": {
          "generation_threshold_size": "32mb"
        },
        "plugins": {
          "replication": {
            "follower": {
              "leader_index": "proxy-to-xxxx:monitoring-000176"
            }
          },
          "index_state_management": {
            "rollover_alias": "monitoring"
          }
        },
        "provided_name": "monitoring-000176",
        "creation_date": "1781543565360",
        "priority": "50",
        "number_of_replicas": "2",
        "uuid": "eql8hWc9S-2qz7HI68QryQ",
        "version": {
          "created": "136408127"
        }
      }
    }
  }
}

also I am able to Search and Read documents in Discover and Dashboards.

Only trouble comes up with Creating Index Pattern.

After testing this out locally I was able to find the below details:

The trigger: opening Dashboards Management → Index Patterns (which you pass through en route to “Create index pattern”) runs an automatic background check: “are there any remote clusters available for cross-cluster search?” Since CCR requires your standby to have a remote connection configured to your primary, that check reaches your primary - independent of what you type, your indexPattern:placeholder setting, or Data Sources (which you’ve confirmed you don’t use).

Why a single instance of this can run for 27 minutes: the check isn’t a lightweight “does this exist” lookup - it’s implemented as a full search with a document-count aggregation, with no filter at all. To answer it, OpenSearch effectively has to account for matching documents across every shard of every index on your entire primary cluster. On a cluster accumulating real data over time (a monitoring-* rollover family especially), that’s not “a slow query” - it’s closer to counting everything in the whole cluster, and the runtime scales with your total data volume, not with the complexity of the question. The actual question only needs a yes/no answer. I think implementing it as a full-cluster count is disproportionate, and that mismatch is the core issue.

Why you saw several of these at once: looking at the task list you shared, three of the four tasks started within about a second of each other. That’s tighter than manual reloading - I’d guess an automatic retry (browser/Dashboards treating the first attempt as failed/slow and firing it again) rather than proxy mode itself.

On the timeout: the API behind this check doesn’t expose a timeout option at all, so there’s nothing to configure that would have bounded it. And even a client-side give-up wouldn’t have stopped the work already running on the cluster - which is why only directly cancelling the tasks (_tasks/_cancel) actually worked.

One thing that would help confirm the retry piece: do you recall seeing a loading spinner, error toast, or anything suggesting a retry in the browser around 13:10 (based on your initial screenshot)?

Thank you for the detailed breakdown! This completely makes sense.

I just ran another test to observe the exact behavior in the browser while tracking the _tasks API on the primary cluster.

From the user perspective, there was no visible freeze or error toast this time. The only indicator in the UI was the loading spinner next to the menu (see screenshot below) while I was going through the steps:

image

It took me about 2 minutes to search for the index pattern, select it, and configure the @timestamp sorting. From the browser’s point of view, everything finished successfully, the spinner stopped, and the index pattern was created.

The Catch: Tasks are left orphaned forever

Even though Dashboards moved on and the UI looks completely fine now, the search tasks on the primary cluster are still running and appear to be stuck indefinitely. They didn’t all fire at once; instead, they accumulated sequentially (roughly every 20–30 seconds) as I progressed through the creation wizard. Dashboards seemingly abandoned the previous requests once it got the data it needed, leaving the primary cluster to process these heavy full-cluster counts forever.

Here is the current task list from the primary cluster, captured minutes after the browser successfully finished the job:

action,task_id,parent_task_id,type,start_time,timestamp,running_time,ip,node
indices:data/read/search,PiKX...hbQ:799289488,-,transport,1781895665974,19:01:05,6.7m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799289519,-,transport,1781895666082,19:01:06,6.7m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799289530,-,transport,1781895666152,19:01:06,6.7m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799294224,-,transport,1781895685076,19:01:25,6.4m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799302961,-,transport,1781895715078,19:01:55,5.9m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799310071,-,transport,1781895736296,19:02:16,5.5m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799317505,-,transport,1781895766299,19:02:46,5m,XXX.XXX.XXX.XXX,XXX-node
indices:data/read/search,PiKX...hbQ:799332315,-,transport,1781895826305,19:03:46,4m,XXX.XXX.XXX.XXX,XXX-node

Dashboards triggers multiple non-configurable full-cluster counts during the multi-step wizard, but it never cancels the previous transport tasks when the frontend lifecycle moves forward or when the user makes a new selection.

By the way, I am planning to migrate our clusters from OpenSearch 2.x to OpenSearch 3.x within the next few weeks. I will keep an eye on this behavior during and after the upgrade to see if the issue persists or if there are any core improvements in how Dashboards handles these remote cluster checks.

@vnovotny98 The upgrade should improve this, but I dont think it will completely fix it, as this PR that was included in 3.x versions, adds enrichment process and crucially, the step that decides whether to query *:* at all. It only runs inside the “Configure data source” step, which is part of the MDS UI and only renders when MDS is enabled.

I see, thanks for the clarification. However, in our architecture, we do not plan to use Multiple Data Sources (MDS). Our setup is a strict Cross-Cluster Replication (CCR) for disaster recovery.

The data is physically replicated and stored locally on the disks of both clusters. When we are in the standby cluster, we only want to manage and query its local data. We do not need or want Dashboards to look at the primary cluster as an external data source.

If this fix is strictly tied to the MDS UI and only renders when MDS is enabled, it means that standard CCR setups like ours will still suffer from this bug even after upgrading to version 3.x. The automatic background check should ideally respect whether a remote connection is just a CCR follower relationship, or at least respect the local index patterns without forcing a full cluster scan on the remote/leader cluster.