Parents circuit breaker tripping for the same node after upgrade to opensearch 2.16.0

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.16.0

Describe the issue:
We upgraded recently from 2.10.0 to 2.16.0 and now the parent circuit breaker trips very often for opensearch-cluster-master-0 pod.

Pod Name	Primary Shards	Replica Shards
opensearch-cluster-master-0	43	370
opensearch-cluster-master-1	175	238
opensearch-cluster-master-10	170	241
opensearch-cluster-master-2	174	239
opensearch-cluster-master-3	159	252
opensearch-cluster-master-4	162	249
opensearch-cluster-master-5	180	231
opensearch-cluster-master-6	182	230
opensearch-cluster-master-7	193	218
opensearch-cluster-master-8	163	249
opensearch-cluster-master-9	176	235

JVM usage has definitely increased

Configuration:
we are using 12 pods with 8 cpu and 64 GB ram and 31 GB heap. cluster is deployed on k8s via opensource helm charts for opensearch.

If it helps, we have a lot of small indices. Each index has 2 primary shards and 2 replicas. Each shard is ~ 500 KB

clusterName: "opensearch-cluster"
nodeGroup: "master"
singleNode: false
masterService: "opensearch-cluster-master"
node.roles=master,ingest,data,remote_cluster_client
roles:
  - master
  - ingest
  - data
  - remote_cluster_client

replicas: 12

image:
  repository: "opensearchproject/opensearch"
  tag: "2.16.0"
  pullPolicy: "IfNotPresent"

opensearchHome: /usr/share/opensearch
config:
    # default max header size is 8kb. increase it to 256kb to get around too_long_http_header_exception
    http.max_header_size: 256KB

    cluster.allocator.existing_shards_allocator.batch_enabled: true
    cluster.routing.allocation.cluster_concurrent_rebalance: 10
    cluster.routing.allocation.enable: all
    cluster.routing.allocation.rebalance.primary.enable: true
    cluster.routing.allocation.shards_batch_gateway_allocator.replica_allocator_timeout: 60s
    cluster.routing.allocation.shards_batch_gateway_allocator.primary_allocator_timeout: 60s

    # Bind to all interfaces because we don't know what IP address Docker will assign to us.
    network.host: 0.0.0.0

Could you show me logs from pods?

I think the size of each shard is too small than your cluster spec (31gb heap for a data node).

What is the purpose of your cluster? searching or logging?

1 Like

We are using the cluster for monitoring and alerting. So we create a monitor and alert on top of every index being created.

@yeonghyeonKo does anything stand out for you in the logs?