Recovering the OpenSearch cluster when "Data too large" error occurs

Nilushan · July 3, 2024, 5:41am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.11.1
Operator 2.6.0

Describe the issue:
We use the OpenSearch operator to manage the OpenSearch cluster. Fluent Bit publishes logs to OpenSearch. We recently encountered an issue in OpenSearch where we saw the Data too large error

{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [610096106/581.8mb], which is larger than the limit of [597688320/570mb], real usage: [610090416/581.8mb], new bytes reserved: [5690/5.5kb], usages [request=0/0b, fielddata=39267/38.3kb, in_flight_requests=124478/121.5kb]","bytes_wanted":610096106,"bytes_limit":597688320,"durability":"TRANSIENT"}}}

The way to fix this is to increase the heap size. We can do that by increasing the Memory requests allocated to data node pods.

But when this issue occurs, the OpenSearch cluster is in a yellow state. When the cluster is in a yellow state, the operator does not apply changes (Increasing memory requests) and do rolling restarts - Rolling restart only possible when cluster is green · Issue #643 · opensearch-project/opensearch-k8s-operator · GitHub.

What would be a recommended way to recover the OpenSearch cluster in a situation like this?
If we increase the memory requests in the opensearch data node statefulset and manually delete one pod at a time, the new memory requests will get applied. If the persistent volume associated with that pod gets deleted and a new one comes up, would OpenSearch replicate data from other data nodes to the new data node? Or if we do this, can there be data loss?

Configuration:
3 master nodes
3 data nodes

Relevant Logs or Screenshots:

Mantas · July 3, 2024, 11:16am

Hi @Nilushan,

Have you considered increasing breaker limits?

indices.breaker.total.limit
indices.breaker.request.limit

See more details here: Circuit breaker settings - OpenSearch Documentation

Best,
mj

Nilushan · July 5, 2024, 4:55am

Hey @Mantas ,
Thank you for looking into this.
From what I understood, circuit breakers are used to prevent the nodes going out of memory. if I increase indices.breaker.total.limit and
indices.breaker.request.limit, we would only be delaying when the circuit breaker would get triggered, right? When the input data size increases wouldn’t the circuit breaker get triggered again and lead to the same error?

Topic		Replies	Views
Circuit_breaking_exception error on kubernetes General Feedback	1	1984	January 3, 2020
Parent circuit breaker intermittent tripping on Saved Object query (and various other operations) OpenSearch troubleshoot , configure	2	43	March 4, 2025
Circuit breaker "parent" tripped OpenSearch	1	119	December 11, 2024
Circuit breaker failure on master pods OpenSearch	1	202	March 4, 2024
How to Solve Circuit Breaker Exception [Data too large] Performance Analyzer troubleshoot	6	20878	April 26, 2022

Recovering the OpenSearch cluster when "Data too large" error occurs

Related topics