Cluster was running fine and randomly nodes losing connection

Hi , i need urgent help if u cant listen me. I have my cluster 3.0 version. My cluster was running since 1 week without issue. Logstash was sending logs from openshift cluster to the opensearch cluster.

Always running good without error. yesterday randomly , my nodes start losing connection to the cluster continuously.

I see in the logs this

[2025-06-17T15:57:40,843][WARN ][i.n.c.AbstractChannelHandlerContext] [opensearch-master-02] An exception 'OpenSearchSecurityException[The provided TCP chann el is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more t han tag size (16)]; nested: SSLHandshakeException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: B adPaddingException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)];' [enable DEBUG level for full stacktrace ] was thrown by a user handler's exceptionCaught() method while handling the following exception: io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more th an tag size (16)

i don’t know how to resolve, i don’t change anything in the configuration. I don’t know if the logs is related about OOM or something else.

My cluster is composed by 9 datanodes (64 GB ram and 12 TB storage )and 3 master (16 GB ram)

My cluster was around 25% allocation.

Help me if u can. This is urgent

@abarocco Do you know if the certificates were auto-rotated perhaps?

Also you should check the Opensearch garbage collection logs for any GC pauses or memory pressure.

Also verify node memory pressure on individual nodes with tools like “dmesg” or “free”.

Lastly how much heap is assigned to the nodes?

hello @Anthony thanks for the reply.
I don’t have auto-rotated certificates.

I check the GC logs and this is what i have

[2025-06-18T12:32:10.759+0000][3530396][gc,start    ] GC(522) Pause Young (Normal) (G1 Evacuation Pause)
[2025-06-18T12:32:10.759+0000][3530396][gc,task     ] GC(522) Using 4 workers of 4 for evacuation
[2025-06-18T12:32:10.759+0000][3530396][gc,age      ] GC(522) Desired survivor size 322961408 bytes, new threshold 15 (max threshold 15)
[2025-06-18T12:32:10.770+0000][3530396][gc,phases   ] GC(522)   Pre Evacuate Collection Set: 0.4ms
[2025-06-18T12:32:10.770+0000][3530396][gc,phases   ] GC(522)   Merge Heap Roots: 0.3ms
[2025-06-18T12:32:10.770+0000][3530396][gc,phases   ] GC(522)   Evacuate Collection Set: 7.7ms
[2025-06-18T12:32:10.770+0000][3530396][gc,phases   ] GC(522)   Post Evacuate Collection Set: 1.8ms
[2025-06-18T12:32:10.770+0000][3530396][gc,phases   ] GC(522)   Other: 0.1ms
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) Age table with threshold 15 (max threshold 15)
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   1:    1144104 bytes,    1144104 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   2:    1095040 bytes,    2239144 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   3:    1057280 bytes,    3296424 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   4:    1049120 bytes,    4345544 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   5:    1233008 bytes,    5578552 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   6:      40880 bytes,    5619432 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   7:    4154280 bytes,    9773712 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   8:      20040 bytes,    9793752 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age   9:     142992 bytes,    9936744 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age  10:     297472 bytes,   10234216 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age  11:     657784 bytes,   10892000 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age  12:     654416 bytes,   11546416 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age  13:     737984 bytes,   12284400 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age  14:     638928 bytes,   12923328 total
[2025-06-18T12:32:10.770+0000][3530396][gc,age      ] GC(522) - age  15:     735808 bytes,   13659136 total
[2025-06-18T12:32:10.770+0000][3530396][gc,heap     ] GC(522) Eden regions: 1224->0(1224)
[2025-06-18T12:32:10.770+0000][3530396][gc,heap     ] GC(522) Survivor regions: 4->4(154)
[2025-06-18T12:32:10.770+0000][3530396][gc,heap     ] GC(522) Old regions: 190->191
[2025-06-18T12:32:10.770+0000][3530396][gc,heap     ] GC(522) Humongous regions: 2->2
[2025-06-18T12:32:10.770+0000][3530396][gc,metaspace] GC(522) Metaspace: 170682K(174336K)->170682K(174336K) NonClass: 149891K(151744K)->149891K(151744K) Class: 20791K(22592K)->20791K(22592K)
[2025-06-18T12:32:10.770+0000][3530396][gc          ] GC(522) Pause Young (Normal) (G1 Evacuation Pause) 5677M->784M(8192M) 10.667ms
[2025-06-18T12:32:10.770+0000][3530396][gc,cpu      ] GC(522) User=0.04s Sys=0.00s Real=0.01s
[2025-06-18T12:32:10.770+0000][3530396][safepoint   ] Safepoint "G1CollectForAllocation", Time since last: 5547621622 ns, Reaching safepoint: 11074 ns, Cleanup: 6335 ns, At safepoint: 10751440 ns, Total: 10768849 ns

On the data node i setup 32GB HEAP SIZE (total VM 64gb) And on the masters node 8GB (Total 16GB)

This appears to be in order, If there were indeed no changes in configuration, this can likely be caused by clock drift, I would recommend to ensure the nodes are synchronized.

Lastly of course this could be a result of packets being dropped over the network, and therefore nodes being unable to properly decrypt the packets.

Ok thanks you

@Anthony So u suggest to check if there are network problem or something else?
There si a specific group in my company that work on the network. What details can I provide to make the best possible network check?

i also check if the nodes are not synchronized each other.
This is the output of 2 nodes chronyd tracking

VM 1

Reference ID    : 0AB0011A (VMGCLALTE1088.syssede.systest.spi.com)
Stratum         : 3
Ref time (UTC)  : Wed Jun 18 13:09:39 2025
System time     : 0.000001429 seconds slow of NTP time
Last offset     : +0.000037964 seconds
RMS offset      : 0.000025977 seconds
Frequency       : 1.282 ppm fast
Residual freq   : +0.000 ppm
Skew            : 0.007 ppm
Root delay      : 0.008408082 seconds
Root dispersion : 0.001250133 seconds
Update interval : 1040.4 seconds
Leap status     : Normal

VM 2

Reference ID    : 0AB0011A (VMGCLALTE1088.syssede.systest.spi.com)
Stratum         : 3
Ref time (UTC)  : Wed Jun 18 12:59:12 2025
System time     : 0.000012626 seconds slow of NTP time
Last offset     : -0.000029539 seconds
RMS offset      : 0.000023536 seconds
Frequency       : 87.303 ppm slow
Residual freq   : -0.000 ppm
Skew            : 0.007 ppm
Root delay      : 0.007700196 seconds
Root dispersion : 0.001777516 seconds
Update interval : 1024.3 seconds
Leap status     : Normal

Based on the above the nodes appear to be synchronised. I would recommend to reach out to networking team asking to confirm if any changes have been done to the following:

Any SSL/TLS interception or DPI
Firewall/NAT timeouts or TCP session resets
MTU (Maximum Transmission Unit) changes (unlikely)

If there is anything running between the nodes, it would be good to bypass it and see if the connections gets fixed, this will enable you to diagnose the issue better.

also if they can confirm if they are seeing packet loss, retransmits, or routing issues between nodes.