Circuit breaker "parent" tripped

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.17

Describe the issue:
Circuit breaker is tripped during queries. The error message says that the “parent” circuit breaker is tripped. As per the documentation the values of these breakers should not be modified but instead the root cause found.

When looking at search_backpressure, the cancelation is at 0 and the status is “monitor_only”. What is the relation of this to the circuit breaker?

We seem to get this issue every 3 months or so. Is it related to heap not being properly cleaned? What can I check/configure to tune the system better to avoid this situation?

So far our solution is to restart the cluster and it’s not too problematic to do it every 3 months but it does seem to point to a larger problem where the heap is gradually lost over time. Are there other settings to clean the heap more aggressively?

This error can also be seen in the data prepper:

2024-12-10T11:51:33,196 [raw-pipeline-processor-worker-9-thread-1] ERROR org.opensearch.dataprepper.plugins.processor.oteltracegroup.OTelTraceGroupProcessor - Search request for traceGroup failed for traceIds: due to OpenSearch exception [type=circuit_breaking_exception, reason=[parent] Data too large, data for [<http_request>] would be [1037622004/989.5mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1037621232/989.5mb], new bytes reserved: [772/772b], usages [request=0/0b, fielddata=7539/7.3kb, in_flight_requests=10484692/9.9mb]]

Configuration:

GET _nodes/stats/breaker

{
“_nodes”: {
“total”: 2,
“successful”: 1,
“failed”: 1,
“failures”: [
{
“type”: “failed_node_exception”,
“reason”: “Failed node [LGr7ErNcQM2h44_Y_4WewQ]”,
“node_id”: “LGr7ErNcQM2h44_Y_4WewQ”,
“caused_by”: {
“type”: “circuit_breaking_exception”,
“reason”: “[parent] Data too large, data for [cluster:monitor/nodes/stats[n]] would be [1031312560/983.5mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1031312432/983.5mb], new bytes reserved: [128/128b], usages [request=0/0b, fielddata=14209/13.8kb, in_flight_requests=128/128b]”,
“bytes_wanted”: 1031312560,
“bytes_limit”: 1020054732,
“durability”: “PERMANENT”
}
}
]
},
“cluster_name”: “opensearch-cluster”,
“nodes”: {
“MvQWodTNQLG2xnzsFuU9qQ”: {
“timestamp”: 1733830114047,
“name”: “opensearch-node1”,
“transport_address”: “192.168.64.4:9300”,
“host”: “192.168.64.4”,
“ip”: “192.168.64.4:9300”,
“roles”: [
“cluster_manager”,
“data”,
“ingest”,
“remote_cluster_client”
],
“attributes”: {
“shard_indexing_pressure_enabled”: “true”
},
“breakers”: {
“request”: {
“limit_size_in_bytes”: 644245094,
“limit_size”: “614.3mb”,
“estimated_size_in_bytes”: 0,
“estimated_size”: “0b”,
“overhead”: 1,
“tripped”: 0
},
“fielddata”: {
“limit_size_in_bytes”: 429496729,
“limit_size”: “409.5mb”,
“estimated_size_in_bytes”: 7320,
“estimated_size”: “7.1kb”,
“overhead”: 1.03,
“tripped”: 0
},
“in_flight_requests”: {
“limit_size_in_bytes”: 1073741824,
“limit_size”: “1gb”,
“estimated_size_in_bytes”: 5241960,
“estimated_size”: “4.9mb”,
“overhead”: 2,
“tripped”: 0
},
“parent”: {
“limit_size_in_bytes”: 1020054732,
“limit_size”: “972.7mb”,
“estimated_size_in_bytes”: 1037126984,
“estimated_size”: “989mb”,
“overhead”: 1,
“tripped”: 83594
}
}
}
}
}

GET _nodes/stats/search_backpressure

{
“_nodes”: {
“total”: 2,
“successful”: 2,
“failed”: 0
},
“cluster_name”: “opensearch-cluster”,
“nodes”: {
“LGr7ErNcQM2h44_Y_4WewQ”: {
“timestamp”: 1733830444618,
“name”: “opensearch-node2”,
“transport_address”: “192.168.64.3:9300”,
“host”: “192.168.64.3”,
“ip”: “192.168.64.3:9300”,
“roles”: [
“cluster_manager”,
“data”,
“ingest”,
“remote_cluster_client”
],
“attributes”: {
“shard_indexing_pressure_enabled”: “true”
},
“search_backpressure”: {
“search_task”: {
“resource_tracker_stats”: {
“cpu_usage_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
},
“heap_usage_tracker”: {
“cancellation_count”: 0,
“current_max_bytes”: 0,
“current_avg_bytes”: 0,
“rolling_avg_bytes”: 0
},
“elapsed_time_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
}
},
“cancellation_stats”: {
“cancellation_count”: 0,
“cancellation_limit_reached_count”: 0
}
},
“search_shard_task”: {
“resource_tracker_stats”: {
“cpu_usage_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
},
“heap_usage_tracker”: {
“cancellation_count”: 0,
“current_max_bytes”: 0,
“current_avg_bytes”: 0,
“rolling_avg_bytes”: 854
},
“elapsed_time_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
}
},
“cancellation_stats”: {
“cancellation_count”: 0,
“cancellation_limit_reached_count”: 0
}
},
“mode”: “monitor_only”
}
},
“MvQWodTNQLG2xnzsFuU9qQ”: {
“timestamp”: 1733830444618,
“name”: “opensearch-node1”,
“transport_address”: “192.168.64.4:9300”,
“host”: “192.168.64.4”,
“ip”: “192.168.64.4:9300”,
“roles”: [
“cluster_manager”,
“data”,
“ingest”,
“remote_cluster_client”
],
“attributes”: {
“shard_indexing_pressure_enabled”: “true”
},
“search_backpressure”: {
“search_task”: {
“resource_tracker_stats”: {
“elapsed_time_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
},
“heap_usage_tracker”: {
“cancellation_count”: 0,
“current_max_bytes”: 0,
“current_avg_bytes”: 0,
“rolling_avg_bytes”: 790112
},
“cpu_usage_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
}
},
“cancellation_stats”: {
“cancellation_count”: 0,
“cancellation_limit_reached_count”: 0
}
},
“search_shard_task”: {
“resource_tracker_stats”: {
“elapsed_time_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
},
“heap_usage_tracker”: {
“cancellation_count”: 0,
“current_max_bytes”: 0,
“current_avg_bytes”: 0,
“rolling_avg_bytes”: 35527
},
“cpu_usage_tracker”: {
“cancellation_count”: 0,
“current_max_millis”: 0,
“current_avg_millis”: 0
}
},
“cancellation_stats”: {
“cancellation_count”: 0,
“cancellation_limit_reached_count”: 0
}
},
“mode”: “monitor_only”
}
}
}
}

Relevant Logs or Screenshots:

I have the heap dumps from the nodes but cannot tell if what I’m seeing is normal or not.

167 instances of “org.opensearch.index.IndexService”, loaded by “jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xc01003a8” occupy 640,351,832 (73.04%) bytes. 

Biggest instances:
•org.opensearch.index.IndexService @ 0xeb014d50 - 35,630,168 (4.06%) bytes. 
•org.opensearch.index.IndexService @ 0xe0cc4448 - 11,868,944 (1.35%) bytes. 
•org.opensearch.index.IndexService @ 0xe43b0648 - 11,717,264 (1.34%) bytes. 
•org.opensearch.index.IndexService @ 0xd98f4808 - 11,197,040 (1.28%) bytes. 
•org.opensearch.index.IndexService @ 0xec0c49d0 - 11,147,056 (1.27%) bytes. 
•org.opensearch.index.IndexService @ 0xce8b0250 - 11,113,088 (1.27%) bytes. 
•org.opensearch.index.IndexService @ 0xd4f04550 - 11,052,912 (1.26%) bytes. 
•org.opensearch.index.IndexService @ 0xd092ba60 - 11,006,416 (1.26%) bytes. 
..
Most of these instances are referenced from one instance of “java.util.concurrent.RunnableScheduledFuture[]”, loaded by “<system class loader>”, which occupies 7,616 (0.00%) bytes. The instance is referenced by “java.lang.Thread @ 0xc2304180 opensearch[opensearch-node1][scheduler][T#1]”, loaded by “<system class loader>”. 

The thread java.lang.Thread @ 0xc2304180 opensearch[opensearch-node1][scheduler][T#1] keeps local variables with total size 16,760 (0.00%) bytes.


Thread “java.lang.Thread @ 0xc2304180 opensearch[opensearch-node1][scheduler][T#1]” has a local variable or reference to “java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue @ 0xc2304210” which is on the shortest path to “java.util.concurrent.RunnableScheduledFuture[913] @ 0xe1126618”. The thread java.lang.Thread @ 0xc2304180 opensearch[opensearch-node1][scheduler][T#1] keeps local variables with total size 16,760 (0.00%) bytes.

Significant stack frames and local variables
•java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()Ljava/util/concurrent/RunnableScheduledFuture; (ScheduledThreadPoolExecutor.java:1182)◦java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue @ 0xc2304210 retains 7,648 (0.00%) bytes

I have the dumps open in Eclipse MAT so if there are questions about something specific I can take a look and report.