Node timeout/crash, will not resync, has to be killed

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Debian 12.11
Opensearch: 2.19.2
Dashboard: 2.19.2
Browser: not relevant

Describe the issue:

Hello guys!

TLDR
One of my nodes keeps crashing without any appearant reason - the process is still running but not performing correctly. It is always the same node. The java process is then stuck and I have to remove opensearch data and kill the process with signal 9 in order to resync the cluster.

I have a three node cluster (40G RAM, 8vcores, fast SAN storage) with security plugin enabled - the nodes form a cluster. One node keeps having timeouts and and some point the node is removed from the cluster and is not restored automatically. To me it looks like the java process is stuck because “systemctl stop opensearch” is not working anymore.

node001 crashed around 02 am with a timeout of 60 seconds

Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [node001][10.100.6.33:9300][internal:index/shard/recovery/finalize] request_id [311189] timed out after [60036ms]

I have already checked that there is no paketloss.
Overmore I am monitoring /_cluster/health on all nodes and I see quite some connection timeouts every 20ish minutes for not longer than 1 minute on node001 up until 02 am (after that api is gone)

could someone please give me a hint howto least debug this behaviour? I have changed the log level to debug by changing log4j2.properties → rootLogger.level = debug in the hopes of getting more insight. Right now we are blind :frowning:

Thank you very much,
Matt

Configuration:

opensearch.yml (identical for all three nodes except for certificates and node-names)

cluster.name: cluster-stage01
node.name: node001
path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
node.roles: [ ‘cluster_manager’,‘data’,‘ingest’ ]

#heap size should be about half of the available system’s memory
bootstrap.memory_lock: true

network.host: 0.0.0.0

#what ip address is published
#this makes sure that 127.0.1.1 is never used
network.publish_host: 10.100.6.33

http.port: 9200

#make sure my hostname does not resolve in 127.0.0.1 - use IPs instead
discovery.seed_hosts: [‘10.100.6.33’,‘10.100.6.34’,‘10.100.6.35’]
cluster.initial_cluster_manager_nodes: [‘node001’,‘node002’,‘node003’]

#tls
#all certificates here must be self signed by ca

#http certificates (rest api) - http.port (9200)
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: tls/node001_full.crt
plugins.security.ssl.http.pemkey_filepath: tls/node001.key
plugins.security.ssl.http.pemtrustedcas_filepath: tls/ca.crt
#transport certificates (node to node communication) - port 9300
plugins.security.ssl.transport.pemcert_filepath: tls/node001_full.crt
plugins.security.ssl.transport.pemkey_filepath: tls/node001.key
plugins.security.ssl.transport.pemtrustedcas_filepath: tls/ca.crt

#do not use demo certificates
plugins.security.allow_unsafe_democertificates: false

#securityindex must be populated manually by securityadmin.sh
plugins.security.allow_default_init_securityindex: false

#the allowed DN in all transport certificates of all other node certificates
plugins.security.nodes_dn:

  • “CN=node001”
  • “CN=node002”
  • “CN=node003”

#the allowed admin DN - the admin certificate is needed to initialize the security plugin index
plugins.security.authcz.admin_dn:

  • “CN=admin.cluster-stage01”

Relevant Logs or Screenshots:

node001 (crashed around 02:00 am )

[2025-06-13T01:36:31,157][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node001] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T01:38:16,361][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T01:43:16,361][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T01:48:16,371][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T01:53:16,384][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T01:53:28,174][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [node001] attempting to trigger G1GC due to high heap usage [14083663184]
[2025-06-13T01:53:28,242][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [node001] GC did bring memory usage down, before [14083663184], after [2279736720], allocations [74], duration [68]
[2025-06-13T01:58:16,397][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T01:58:34,676][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node001] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T02:00:03,613][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:00:03,696][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:00:03,750][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:00:03,841][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:00:03,889][INFO ][o.o.p.PluginsService ] [node001] PluginService:onIndexModule index:[top_queries-2025.06.13-11037/aMgkC6MTSa6ORyJQAmwsiQ]
[2025-06-13T02:00:03,916][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:00:03,950][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:01:04,034][WARN ][o.o.i.c.IndicesClusterStateService] [node001] [top_queries-2025.06.13-11037][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [top_queries-2025.06.13-11037][0]: Recovery failed from {node002}{yaz2vkedQzqSFrz8guMxDg}{Mldrd-FMRQWaWl8kTqUN1A}{10.100.6.34}{10.100.6.34:9300}{dim}{shard_indexing_pressure_enabled=true} into {node001}{5Wnp8XISRlatTW3TWXjdeQ}{5YY7TRWiS92soWZyqSe6Vg}{10.100.6.33}{10.100.6.33:9300}{dim}{shard_indexing_pressure_enabled=true} ([top_queries-2025.06.13-11037][0]: Recovery failed from {node002}{yaz2vkedQzqSFrz8guMxDg}{Mldrd-FMRQWaWl8kTqUN1A}{10.100.6.34}{10.100.6.34:9300}{dim}{shard_indexing_pressure_enabled=true} into {node001}{5Wnp8XISRlatTW3TWXjdeQ}{5YY7TRWiS92soWZyqSe6Vg}{10.100.6.33}{10.100.6.33:9300}{dim}{shard_indexing_pressure_enabled=true})
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:764) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:690) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:429) [opensearch-security-2.19.2.0.jar:2.19.2.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1527) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:454) [opensearch-2.19.2.jar:2.19.2]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955) [opensearch-2.19.2.jar:2.19.2]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [top_queries-2025.06.13-11037][0]: Recovery failed from {node002}{yaz2vkedQzqSFrz8guMxDg}{Mldrd-FMRQWaWl8kTqUN1A}{10.100.6.34}{10.100.6.34:9300}{dim}{shard_indexing_pressure_enabled=true} into {node001}{5Wnp8XISRlatTW3TWXjdeQ}{5YY7TRWiS92soWZyqSe6Vg}{10.100.6.33}{10.100.6.33:9300}{dim}{shard_indexing_pressure_enabled=true}
… 9 more
Caused by: org.opensearch.transport.RemoteTransportException: [node002][10.100.6.34:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [node001][10.100.6.33:9300][internal:index/shard/recovery/finalize] request_id [311189] timed out after [60036ms]
at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1421) ~[opensearch-2.19.2.jar:2.19.2]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955) ~[opensearch-2.19.2.jar:2.19.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) ~[?:?]
[2025-06-13T02:01:04,077][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:03:16,397][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:05:00,155][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:08:16,398][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:13:16,398][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:18:16,399][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:23:16,400][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:28:16,400][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:33:16,401][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:38:16,401][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:43:16,402][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:48:16,402][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:50:48,488][INFO ][o.o.p.PluginsService ] [node001] PluginService:onIndexModule index:[security-auditlog-2025.06.13/QOWSCcndTjmp9C8rH91EjA]
[2025-06-13T02:50:48,509][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:50:48,749][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:50:48,801][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:51:17,515][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:51:48,940][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:51:48,973][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:52:49,134][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:52:49,204][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node001] Cancelling the migration process.
[2025-06-13T02:53:16,403][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T02:55:14,133][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node001] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T02:58:16,404][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:03:16,435][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:08:16,451][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:12:12,837][INFO ][o.o.t.t.CronTransportAction] [node001] Start running hourly cron.
[2025-06-13T03:13:16,484][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:13:17,092][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node001] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T03:18:16,484][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:23:16,523][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:28:16,557][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:33:16,571][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep
[2025-06-13T03:38:16,592][INFO ][o.o.j.s.JobSweeper ] [node001] Running full sweep

node002 (crashed around 02:00 am )

[2025-06-13T01:49:06,919][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T01:50:31,462][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T01:54:54,665][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T01:55:31,462][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T01:55:57,209][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T01:59:42,387][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T02:00:03,530][INFO ][o.o.p.PluginsService ] [node002] PluginService:onIndexModule index:[top_queries-2025.06.13-11037/aMgkC6MTSa6ORyJQAmwsiQ]
[2025-06-13T02:00:03,569][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:00:03,697][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:00:03,750][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:00:03,845][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:00:03,861][INFO ][o.o.i.r.RecoverySourceHandler] [node002] [top_queries-2025.06.13-11037][0][recover to node003] finalizing recovery took [8ms]
[2025-06-13T02:00:03,887][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:00:03,954][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:00:31,463][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:01:04,076][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:02:20,017][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T02:03:50,089][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T02:04:22,614][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [node002] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.19.2.jar:2.19.2]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.118.Final.jar:4.1.118.Final]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
[2025-06-13T02:05:00,136][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:05:31,464][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:10:31,464][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:12:12,836][INFO ][o.o.t.t.CronTransportAction] [node002] Start running hourly cron.
[2025-06-13T02:12:12,837][INFO ][o.o.a.t.ADTaskManager ] [node002] Start to maintain running historical tasks
[2025-06-13T02:15:31,465][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:20:31,465][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:25:31,466][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:30:31,466][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:35:31,467][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:40:31,467][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:45:31,468][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:50:31,468][INFO ][o.o.j.s.JobSweeper ] [node002] Running full sweep
[2025-06-13T02:50:48,491][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:50:48,751][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:50:48,803][INFO ][o.o.p.PluginsService ] [node002] PluginService:onIndexModule index:[security-auditlog-2025.06.13/QOWSCcndTjmp9C8rH91EjA]
[2025-06-13T02:50:48,820][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.
[2025-06-13T02:51:17,516][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node002] Cancelling the migration process.

Hey @tttt

What I noticed in your logs was the following.

Exception during establishing a SSL connection: java.net.SocketException: Connection reset

These errors typically occur when:

  • A client initiates an SSL handshake but closes it prematurely.

  • A misconfigured HTTP client erroneously connects to the SSL port using plain HTTP.

  • Network interruptions or timeouts.

Failed to allocate directory watch: Too many open files

Too Many Open Files” or Inotify Limit Warnings

Check current inotify limits:

sysctl fs.inotify

I would also monitor your node ( i.e. metrics/logs) to see what happens around the 2 AM time. Perhaps can find a clue on whats happening.
By chance at 2:00 AM, a background task (cron job, backup, security scan) running?
Did you check that ulimit -n (open files) is >= 65536 for the OpenSearch process?

hello Gsmitt!

first of all - thank you very much for your reply!

limit of files
Could you please tell me where you found the line [1] - it does not occur in the logfiles I have added and it does not come up in all log-files on the servers?

Nevertheless here are my settings:

#sysctl fs.inotify
fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 218535

the systemd unit was not changed

#grep -i limit /usr/lib/systemd/system/opensearch.service
LimitNOFILE=65535
LimitNPROC=4096
LimitAS=infinity
LimitFSIZE=infinity

but memlock was LimitMEMLOCK was set to infinity by overwriting the unit:

#cat /etc/systemd/system/opensearch.service.d/50-opensearch-memlock.conf
[Service]
LimitMEMLOCK=infinity

I rechecked by analyzing the uid via /proc/PID/limits (same output on all nodes):

ssl connection error
regarding the ssl connection error. I tested for paketloss by running mtr for nearly a week - the result was 0.0% paketloss. I therefore think that the application closes the handshake prematurely. I see these logs on other nodes as well without any problem.

suspects
The cluster had to same problem just yesterday but on a different node. I am suspecting either openshift-dashboard or the security plugin. I am thinking about moving the dashboards to its own VM altough I find that highly unlikely. and I am thinking about disabling the security plugin completely.

The strange thing is that the broken node will not answer to any rest-api-call until I kill the process by hand and restart the service. To me this really looks like an opensearch internal bug. Do you know about any other parameter that enables debugging output? it is really strange that there is no appearant reason for this behaviour just the timeout message.

Do you have any other pointers?

Thanks again,
Matt

[1]

Failed to allocate directory watch: Too many open files`