Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.11.1
Operator 2.6.0
Describe the issue:
We use the OpenSearch operator to manage the OpenSearch cluster. Fluent Bit publishes logs to OpenSearch. We recently encountered an issue in OpenSearch where we saw the Data too large
error
{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [610096106/581.8mb], which is larger than the limit of [597688320/570mb], real usage: [610090416/581.8mb], new bytes reserved: [5690/5.5kb], usages [request=0/0b, fielddata=39267/38.3kb, in_flight_requests=124478/121.5kb]","bytes_wanted":610096106,"bytes_limit":597688320,"durability":"TRANSIENT"}}}
The way to fix this is to increase the heap size. We can do that by increasing the Memory requests allocated to data node pods.
But when this issue occurs, the OpenSearch cluster is in a yellow state. When the cluster is in a yellow state, the operator does not apply changes (Increasing memory requests) and do rolling restarts - Rolling restart only possible when cluster is green · Issue #643 · opensearch-project/opensearch-k8s-operator · GitHub.
What would be a recommended way to recover the OpenSearch cluster in a situation like this?
If we increase the memory requests in the opensearch data node statefulset and manually delete one pod at a time, the new memory requests will get applied. If the persistent volume associated with that pod gets deleted and a new one comes up, would OpenSearch replicate data from other data nodes to the new data node? Or if we do this, can there be data loss?
Configuration:
3 master nodes
3 data nodes
Relevant Logs or Screenshots: