Recovering the OpenSearch cluster when "Data too large" error occurs

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.11.1
Operator 2.6.0

Describe the issue:
We use the OpenSearch operator to manage the OpenSearch cluster. Fluent Bit publishes logs to OpenSearch. We recently encountered an issue in OpenSearch where we saw the Data too large error

{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [610096106/581.8mb], which is larger than the limit of [597688320/570mb], real usage: [610090416/581.8mb], new bytes reserved: [5690/5.5kb], usages [request=0/0b, fielddata=39267/38.3kb, in_flight_requests=124478/121.5kb]","bytes_wanted":610096106,"bytes_limit":597688320,"durability":"TRANSIENT"}}}

The way to fix this is to increase the heap size. We can do that by increasing the Memory requests allocated to data node pods.

But when this issue occurs, the OpenSearch cluster is in a yellow state. When the cluster is in a yellow state, the operator does not apply changes (Increasing memory requests) and do rolling restarts - Rolling restart only possible when cluster is green · Issue #643 · opensearch-project/opensearch-k8s-operator · GitHub.

What would be a recommended way to recover the OpenSearch cluster in a situation like this?
If we increase the memory requests in the opensearch data node statefulset and manually delete one pod at a time, the new memory requests will get applied. If the persistent volume associated with that pod gets deleted and a new one comes up, would OpenSearch replicate data from other data nodes to the new data node? Or if we do this, can there be data loss?

Configuration:
3 master nodes
3 data nodes

Relevant Logs or Screenshots:

Hi @Nilushan,

Have you considered increasing breaker limits?

indices.breaker.total.limit
indices.breaker.request.limit

See more details here: Circuit breaker settings - OpenSearch Documentation

Best,
mj

Hey @Mantas ,
Thank you for looking into this.
From what I understood, circuit breakers are used to prevent the nodes going out of memory. if I increase indices.breaker.total.limit and
indices.breaker.request.limit, we would only be delaying when the circuit breaker would get triggered, right? When the input data size increases wouldn’t the circuit breaker get triggered again and lead to the same error?

1 Like