Hi @Mantas ,
I reduced the parent circuit breaker limit (indices.breaker.total.limit) from 95% to 70%. Even after doing that, when I tried to take a snapshot with many indices, a data node’s heap went out of memory and that pod restarted resulting in the same node_shutdown
error.
Please see the logs below
[2024-07-05T06:13:21,807][INFO ][o.o.j.s.JobSweeper ] [opensearch-data-0] Running full sweep
[2024-07-05T06:13:22,520][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-data-0] Detected cluster change event for destination migration
[2024-07-05T06:13:25,039][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [844187136]
[2024-07-05T06:13:25,059][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [844187136], after [714688000], allocations [186], duration [20]
[2024-07-05T06:13:30,133][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1004094976]
[2024-07-05T06:13:30,146][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [1004094976], after [903587328], allocations [33], duration [13]
[2024-07-05T06:13:35,159][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [898535424]
[2024-07-05T06:13:35,175][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [898535424], after [662671376], allocations [136], duration [16]
[2024-07-05T06:13:43,346][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [839589376]
[2024-07-05T06:13:43,364][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [839589376], after [814423552], allocations [191], duration [18]
[2024-07-05T06:13:46,713][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67511] overhead, spent [985ms] collecting in the last [1.2s]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to data/java_pid30.hprof ...
[2024-07-05T06:13:48,764][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67512] overhead, spent [1.9s] collecting in the last [2s]
[2024-07-05T06:13:48,764][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1015699088]
Heap dump file created [1188532983 bytes in 7.458 secs]
[2024-07-05T06:13:56,221][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did not bring memory usage down, before [1015699088], after [1016929696], allocations [1], duration [7457]
[2024-07-05T06:13:57,069][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-data-0] fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323) ~[?:?]
at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635) ~[?:?]
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215) ~[?:?]
at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:137) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:129) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335) ~[?:?]
at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364) ~[?:?]
at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866) ~[?:?]
at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821) ~[?:?]
at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531) ~[?:?]
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125) ~[?:?]
at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325) ~[?:?]
at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434) ~[?:?]
at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223) ~[?:?]
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128) ~[?:?]
at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201) ~[?:?]
at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335) ~[?:?]
fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323)
at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635)
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:137)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335)
at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364)
at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866)
at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821)
at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531)
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325)
at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434)
at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128)
at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201)
at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335)