Heap space in data nodes go out of memory when taking snapshots

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.11.1
Operator 2.6.0

Describe the issue:
We have an OpenSearch cluster that we are using for log storage and querying. It had the following resources allocated for each data node.

requests:
    cpu: "500m"
    memory: "1400Mi"
limits:
    cpu: "800m"
    memory: "1700Mi"

So these data nodes had a heap of 700Mi as the heap will be set to 50% of requested memory by default.
When I tried to take a snapshot of an index (which had a primary shard of 1.1GB and 2 replica shards), a data node’s heap went out of memory and the pod restarted resulting in that node leaving the cluster. This resulted in the snapshot not completing properly.

When I increased the resources to the following (which results in a heap size of 1Gi), I was able to take the snapshot.

requests:
    cpu: "500m"
    memory: "2000Mi"
limits:
    cpu: "800m"
    memory: "2100Mi"

But when I tried to take a snapshot with more indexes using the above resources, the data node’s Java heap went out of memory again.

The data nodes are able to ingest logs and serve queries with a given heap size. However the heap goes out of memory when I try to take a snapshot. When I increase the heap size to a level that make snapshots possible, the node would be running with more memory than what is required for log ingestion and querying. So there is a waste of resources.

Therefore,

  1. Does the snapshot feature have a minimum required heap size?
  2. Is there a specific “snapshot” node type in OpenSearch similar to data and master nodes? Then I can have 1 dedicated node for snapshots instead of increasing the memory in all data nodes

Configuration:
3 data nodes
3 master nodes
Using Azure storage account as snapshot repository

Relevant Logs or Screenshots:

Hi @Nilushan,

Have you checked Circuit breaker settings: Circuit breaker settings - OpenSearch Documentation

Best,
mj

Hi @Mantas,
Thanks. Let me check these settings

Hi @Mantas ,
I reduced the parent circuit breaker limit (indices.breaker.total.limit) from 95% to 70%. Even after doing that, when I tried to take a snapshot with many indices, a data node’s heap went out of memory and that pod restarted resulting in the same node_shutdown error.
Please see the logs below

[2024-07-05T06:13:21,807][INFO ][o.o.j.s.JobSweeper       ] [opensearch-data-0] Running full sweep
[2024-07-05T06:13:22,520][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-data-0] Detected cluster change event for destination migration
[2024-07-05T06:13:25,039][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [844187136]
[2024-07-05T06:13:25,059][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [844187136], after [714688000], allocations [186], duration [20]
[2024-07-05T06:13:30,133][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1004094976]
[2024-07-05T06:13:30,146][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [1004094976], after [903587328], allocations [33], duration [13]
[2024-07-05T06:13:35,159][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [898535424]
[2024-07-05T06:13:35,175][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [898535424], after [662671376], allocations [136], duration [16]
[2024-07-05T06:13:43,346][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [839589376]
[2024-07-05T06:13:43,364][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [839589376], after [814423552], allocations [191], duration [18]
[2024-07-05T06:13:46,713][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67511] overhead, spent [985ms] collecting in the last [1.2s]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to data/java_pid30.hprof ...
[2024-07-05T06:13:48,764][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67512] overhead, spent [1.9s] collecting in the last [2s]
[2024-07-05T06:13:48,764][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1015699088]
Heap dump file created [1188532983 bytes in 7.458 secs]
[2024-07-05T06:13:56,221][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did not bring memory usage down, before [1015699088], after [1016929696], allocations [1], duration [7457]
[2024-07-05T06:13:57,069][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-data-0] fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323) ~[?:?]
	at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635) ~[?:?]
	at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646) ~[?:?]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215) ~[?:?]
	at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180) ~[?:?]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:137) ~[?:?]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:129) ~[?:?]
	at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335) ~[?:?]
	at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364) ~[?:?]
	at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866) ~[?:?]
	at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821) ~[?:?]
	at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531) ~[?:?]
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125) ~[?:?]
	at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
	at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325) ~[?:?]
	at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434) ~[?:?]
	at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223) ~[?:?]
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
	at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128) ~[?:?]
	at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201) ~[?:?]
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335) ~[?:?]
fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323)
	at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635)
	at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646)
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
	at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:137)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
	at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378)
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160)
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335)
	at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364)
	at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866)
	at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821)
	at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
	at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
	at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325)
	at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434)
	at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223)
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
	at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128)
	at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201)
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335)

By the looks, you will need to assign more memory.

You might be interested in the below, for best practices when calculating recourse :

Sizing Amazon OpenSearch Service domains - Amazon OpenSearch Service

best,
mj

Hi @Mantas ,
This might be an issue with the Azure snapshot plugin. The reason to think so is, for a given amount of memory, I am able to take snapshots to AWS S3. But when I try to take the same snapshot to an Azure storage account, the OpenSearch data node crashes.

I opened a GitHub issue for this as well - [BUG] Heap space goes out of memory and the node crashes when taking snapshots · Issue #14666 · opensearch-project/OpenSearch · GitHub

1 Like