High heap size on data nodes

hi, im getting this error in my opensearch cluster
I cant find any docs on this, any ideas?
[monitor_only mode] cancelling task [103529040] due to high resource consumption [heap usage exceeded [397.8mb >= 513.3kb]]

Hey @taltsafrir

Maybe this might help. Under important settings

You can adjust this in Opensearch jvm.options file

root@ansible:/opt/logstash-8.6.1/config# cat /etc/opensearch/jvm.options
## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://opensearch.org/docs/opensearch/install/important-settings/
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms4g
-Xmx4g

and how can I change this parameter?

Hey @taltsafrir

I’m not sure what Im looking at, what kind of installation is this?

I ran GET /_nodes/stats/breaker

When the Parent Circuit Breaker trips, it’s usually one of two things:

  • Too little heap for what you need. In this case, you’ll have to either add more heap (either by increasing heap size or adding more nodes, assuming data can rebalance) or reduce your heap usage
  • Garbage collector doesn’t clean up in time. In this case you can tweak GC to kick in more often or spend more CPU collecting.

For the first case, it depends on what’s taking heap. Is it more “static” memory, such as fielddata, caches, indexing buffer? Or are these expensive queries that suddenly increase your heap usage? You can usually figure it out with some good OpenSearch monitoring, which would plot all these metrics over time and you can see (and alert on) any spikes.

If it’s “static memory”, you may be able to reduce consumption by reducing e.g. cache sizes. If it’s the queries (worst case scenario, a heap dump in times of trouble will tell). you may be able to restructure them. But sometimes you just need more heap/nodes.

But it may not be “live” memory, it may be a GC problem. I’ve seen this in two scenarios:

  1. In very low-usage clusters with the default configuration. G1 GC is adaptive, and it may kick in very late in order to be efficient, tripping the Parent Circuit breaker. You can see if that’s the case if you see that there aren’t that many collections in the JVM GC logs and they don’t take long.

In this case you’ll want to turn off the “adaptiveness” via -XX:-G1UseAdaptiveIHOP and limit the old gen and the young gen, at least before GC kicks in. Something like -XX:InitiatingHeapOccupancyPercent=30 -XX:G1MaxNewSizePercent=45 -XX:+UnlockExperimentalVMOptions would add up to 75%, leaving enough headroom for GC to kick in before the default 95% Parent Circuit Breaker limit.

  1. In very high usage clusters (e.g. lots of concurrent queries), GC may not be able to keep up with allocations. You can see that best in the JVM metrics if heap keeps going up until a point when you have long GC times (typically full GCs) then heap goes down again. That often happens because the default GCTimeRatio is high, allowing only a few percent of the CPU to be used for GC. You can fix that with something like -XX:GCTimeRatio=3 or a similarly low value.

All of the -XX... options above go in jvm.options and require restarting OpenSearch. You’ll find more GC tuning tips in this blog post: How to Tune Java Garbage Collection - Sematext If you’re using a recent version of OpenSearch and Java (e.g. Java 17), then ZGC should pretty much solve all your GC issues :slight_smile:

3 Likes