Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.16.0
Describe the issue:
We upgraded recently from 2.10.0 to 2.16.0 and now the parent circuit breaker trips very often for opensearch-cluster-master-0 pod.
Pod Name Primary Shards Replica Shards
opensearch-cluster-master-0 43 370
opensearch-cluster-master-1 175 238
opensearch-cluster-master-10 170 241
opensearch-cluster-master-2 174 239
opensearch-cluster-master-3 159 252
opensearch-cluster-master-4 162 249
opensearch-cluster-master-5 180 231
opensearch-cluster-master-6 182 230
opensearch-cluster-master-7 193 218
opensearch-cluster-master-8 163 249
opensearch-cluster-master-9 176 235
JVM usage has definitely increased
Configuration:
we are using 12 pods with 8 cpu and 64 GB ram and 31 GB heap. cluster is deployed on k8s via opensource helm charts for opensearch.
If it helps, we have a lot of small indices. Each index has 2 primary shards and 2 replicas. Each shard is ~ 500 KB
clusterName: "opensearch-cluster"
nodeGroup: "master"
singleNode: false
masterService: "opensearch-cluster-master"
node.roles=master,ingest,data,remote_cluster_client
roles:
- master
- ingest
- data
- remote_cluster_client
replicas: 12
image:
repository: "opensearchproject/opensearch"
tag: "2.16.0"
pullPolicy: "IfNotPresent"
opensearchHome: /usr/share/opensearch
config:
# default max header size is 8kb. increase it to 256kb to get around too_long_http_header_exception
http.max_header_size: 256KB
cluster.allocator.existing_shards_allocator.batch_enabled: true
cluster.routing.allocation.cluster_concurrent_rebalance: 10
cluster.routing.allocation.enable: all
cluster.routing.allocation.rebalance.primary.enable: true
cluster.routing.allocation.shards_batch_gateway_allocator.replica_allocator_timeout: 60s
cluster.routing.allocation.shards_batch_gateway_allocator.primary_allocator_timeout: 60s
# Bind to all interfaces because we don't know what IP address Docker will assign to us.
network.host: 0.0.0.0