Parents circuit breaker tripping for the same node after upgrade to opensearch 2.16.0

pbagrecha · August 16, 2024, 10:35pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.16.0

Describe the issue:
We upgraded recently from 2.10.0 to 2.16.0 and now the parent circuit breaker trips very often for opensearch-cluster-master-0 pod.

Pod Name	Primary Shards	Replica Shards
opensearch-cluster-master-0	43	370
opensearch-cluster-master-1	175	238
opensearch-cluster-master-10	170	241
opensearch-cluster-master-2	174	239
opensearch-cluster-master-3	159	252
opensearch-cluster-master-4	162	249
opensearch-cluster-master-5	180	231
opensearch-cluster-master-6	182	230
opensearch-cluster-master-7	193	218
opensearch-cluster-master-8	163	249
opensearch-cluster-master-9	176	235

JVM usage has definitely increased

Configuration:
we are using 12 pods with 8 cpu and 64 GB ram and 31 GB heap. cluster is deployed on k8s via opensource helm charts for opensearch.

If it helps, we have a lot of small indices. Each index has 2 primary shards and 2 replicas. Each shard is ~ 500 KB

clusterName: "opensearch-cluster"
nodeGroup: "master"
singleNode: false
masterService: "opensearch-cluster-master"
node.roles=master,ingest,data,remote_cluster_client
roles:
  - master
  - ingest
  - data
  - remote_cluster_client

replicas: 12

image:
  repository: "opensearchproject/opensearch"
  tag: "2.16.0"
  pullPolicy: "IfNotPresent"

opensearchHome: /usr/share/opensearch
config:
    # default max header size is 8kb. increase it to 256kb to get around too_long_http_header_exception
    http.max_header_size: 256KB

    cluster.allocator.existing_shards_allocator.batch_enabled: true
    cluster.routing.allocation.cluster_concurrent_rebalance: 10
    cluster.routing.allocation.enable: all
    cluster.routing.allocation.rebalance.primary.enable: true
    cluster.routing.allocation.shards_batch_gateway_allocator.replica_allocator_timeout: 60s
    cluster.routing.allocation.shards_batch_gateway_allocator.primary_allocator_timeout: 60s

    # Bind to all interfaces because we don't know what IP address Docker will assign to us.
    network.host: 0.0.0.0

yeonghyeonKo · August 18, 2024, 12:11am

Could you show me logs from pods?

I think the size of each shard is too small than your cluster spec (31gb heap for a data node).

What is the purpose of your cluster? searching or logging?

pbagrecha · August 20, 2024, 9:56am

We are using the cluster for monitoring and alerting. So we create a monitor and alert on top of every index being created.

gist.github.com

https://gist.github.com/bagipriyank/dd5e7e7f50b1d2f3798b4a85a7728f42

logs.txt

Common labels: {"attributes__p":"F","attributes_rmcp_annotations_cloud_google_com_cluster_autoscaler_unhelpable_since":"2024-08-13T08:15:32+0000","attributes_rmcp_annotations_cloud_google_com_cluster_autoscaler_unhelpable_until":"Inf","attributes_rmcp_annotations_configchecksum":"95be7c838f1b945d6025c758b930c89decfce7dca11b5d55d1755f5fb5c03b1","attributes_rmcp_annotations_kubernetes_io_limit_ranger":"LimitRanger plugin set: cpu, memory request for init container fsgroup-volume; cpu, memory request for init container sysctl; cpu, memory request for init container smtp-setup","attributes_rmcp_annotations_securityconfigchecksum":"b4e387f0a42472e8a0b10c809d98b42a014b9658b3bf1e788df1e8b8e88d72b","attributes_rmcp_container_image":"docker-external.artifactory.tools.roku.com/opensearchproject/opensearch:2.10.0","attributes_rmcp_env":"prod","attributes_rmcp_labels_app_kubernetes_io_component":"opensearch-cluster-master","attributes_rmcp_labels_app_kubernetes_io_instance":"opensearch","attributes_rmcp_labels_app_kubernetes_io_managed_by":"Helm","attributes_rmcp_labels_app_kubernetes_io_name":"opensearch","attributes_rmcp_labels_app_kubernetes_io_version":"2.9.0","attributes_rmcp_labels_controller_revision_hash":"opensearch-cluster-master-84f45559db","attributes_rmcp_labels_helm_sh_chart":"opensearch-2.14.0","attributes_rmcp_labels_statefulset_kubernetes_io_pod_name":"opensearch-cluster-master-0","attributes_rmcp_pod_id":"06606c33-ae59-4cda-b490-ea35cc768675","exporter":"OTLP","rmcp_account":"r6-de-prod-rmcp-c958","rmcp_cluster_name":"rmcp","rmcp_container_name":"opensearch","rmcp_host":"sin-4e4ebe77b378-oesg-7a137f47","rmcp_namespace_name":"dmf","rmcp_pod_name":"opensearch-cluster-master-0","rmcp_region":"us-east4","service_name":"unknown_service","stream":"stdout","tenant":"de"}
Line limit: "1000 (1626 displayed)"
Total bytes processed: "3.22  MB"


2024-08-17 00:45:19.991	[2024-08-17T07:45:19,990][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-master-0] Cancelling the migration process.
2024-08-17 00:45:19.920	[2024-08-17T07:45:19,920][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-master-0] Cancelling the migration process.
2024-08-17 00:45:19.895	[2024-08-17T07:45:19,895][INFO ][o.o.i.r.RecoverySourceHandler] [opensearch-cluster-master-0] [ami_reach_amp.agg_overlapped_acr_impressions_reach_impressions_counts_checks_prod_ds][0][recover to opensearch-cluster-master-9] finalizing recovery took [7.1ms]
2024-08-17 00:45:19.807	[2024-08-17T07:45:19,807][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-master-0] Cancelling the migration process.
2024-08-17 00:45:19.702	[2024-08-17T07:45:19,701][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-cluster-master-0] Cancelling the migration process.

This file has been truncated. show original

pbagrecha · August 31, 2024, 5:36pm

@yeonghyeonKo does anything stand out for you in the logs?

Topic		Replies	Views
Circuit breaker "parent" tripped OpenSearch	1	101	December 11, 2024
Circuit breaker failure on master pods OpenSearch	1	196	March 4, 2024
Parent circuit breaker intermittent tripping on Saved Object query (and various other operations) OpenSearch troubleshoot , configure	2	31	March 4, 2025
Data Node Not Joining Cluster After Upgrade OpenSearch troubleshoot , configure , install , upgrade	0	134	September 11, 2024
High cpu on data nodes OpenSearch troubleshoot	4	325	August 20, 2024

Parents circuit breaker tripping for the same node after upgrade to opensearch 2.16.0

Related topics