SSL Exception Connection Reset Error on Master Nodes

kkoki · November 4, 2024, 2:52am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
bitnami/opensearch
2.15.0-debian-12-r4

Describe the issue:
We have configured an OpenSearch cluster with 3 master nodes and 6 data nodes.
The data is being indexed normally, and everything appears to be functioning without issues.
However, we encounter SSL Exception errors about once or twice an hour.
The cluster status shows that everything is healthy.
The problematic logs are only appearing on the three master nodes.

What could be causing the masters to experience connection resets?

Configuration:

opensearch:
  enabled: true
 
  extraConfig:
    plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
    plugins.security.ssl.http.enabled: true
    plugins.security.allow_default_init_securityindex: true
 
  security:
    enabled: true
    adminPassword: "1234"
    logstashPassword: "1234"
    tls:
      restEncryption: true
      autoGenerated: true
      verificationMode: "none"
 
  service:
    type: NodePort
    ports:
      restAPI: 9200
      transport: 9300
    nodePorts:
      restAPI: 31031
      transport: 31032
 
  master:
    replicaCount: 3
    resources:
      limits:
        cpu: "4000m"
        memory: "16Gi"
      requests:
        cpu: "4000m"
        memory: "16Gi"
    heapSize: 5120m
    persistence:
      size: 100Gi
 
 
  data:
    replicaCount: 6
    resources:
      limits:
        cpu: "4000m"
        memory: "16Gi"
      requests:
        cpu: "4000m"
        memory: "16Gi"
    heapSize: 5120m
    persistence:
      size: 4700Gi
    extraRoles:
      - "ingest"
  
  coordinating:
    replicaCount: 0
 
  ingest:
    enabled: false
    replicaCount: 0
 
  dashboards:
    enabled: true # :: dashboard
    service:
      type: NodePort
      nodePorts:
        http: 31030
    password: "1234"
    persistence:
      enabled: true
      size: 50Gi

Relevant Logs or Screenshots:

<_cat/nodes>

10.42.235.16  50 36 1 4.25 4.47 4.86 m  cluster_manager * common-opensearch-master-2
10.42.190.31  58 85 2 5.19 5.10 5.34 di data,ingest     - common-opensearch-data-0
10.42.137.212 37 35 1 4.94 3.43 2.59 m  cluster_manager - common-opensearch-master-0
10.42.137.253 72 93 2 4.94 3.43 2.59 di data,ingest     - common-opensearch-data-3
10.42.235.7   66 97 4 4.25 4.47 4.86 di data,ingest     - common-opensearch-data-4
10.42.134.219 35 35 0 1.31 1.95 2.39 m  cluster_manager - common-opensearch-master-1
10.42.134.214 38 91 0 1.31 1.95 2.39 di data,ingest     - common-opensearch-data-2
10.42.118.235 16 79 1 4.58 2.55 2.27 di data,ingest     - common-opensearch-data-1
10.42.69.152  26 95 1 5.06 4.52 4.12 di data,ingest     - common-opensearch-data-5

<_cluster/health>

{
  "cluster_name": "open",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 9,
  "number_of_data_nodes": 6,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 154,
  "active_shards": 323,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

yeonghyeonKo · November 4, 2024, 5:17am

Does your cluster use ingress for connecting from users to cluster itself through 80 or 443 port?

ex. https://[cluster_name].[domain]

kkoki · November 5, 2024, 2:56am

We access it only internally within the Kubernetes cluster via a NodePort service.
For example, using:
https://common-opensearch:9200
or
https://common-opensearch.{namespace}:9200

I checked the logs of the services, and the same SSL exception connection reset error is occurring.
Is the cause of the problem on the client trying to connect to the master, rather than within the cluster itself?

kkoki · November 5, 2024, 6:42am

Looks like I’ve found the same bug issue.

github.com/opensearch-project/security

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15

opened 10:05AM - 16 Aug 24 UTC

blueish-eyez

bug v3.0.0 v2.18.0

### Describe the bug I had a working cluster free of errors, however post upg…rade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes: ``` [2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [opensearch-0] Exception during establishing a SSL connection: java.net.SocketException: Connection reset java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?] at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?] at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.16.0.jar:2.16.0] at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.16.0.jar:2.16.0] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.111.Final.jar:4.1.111.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.111.Final.jar:4.1.111.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.111.Final.jar:4.1.111.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.111.Final.jar:4.1.111.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.111.Final.jar:4.1.111.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) [netty-common-4.1.111.Final.jar:4.1.111.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.111.Final.jar:4.1.111.Final] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] ``` This message is always identical and I cannot pinpoint it to a specific action performed by OpenSearch. The only thing that's also repeatedly logged is: ``` [INFO ][o.o.j.s.JobSweeper ] [opensearch-0] Running full sweep ``` But I feel it's a long shot. I did not see this problem on 2.11 and this only started appearing after upgrading. Has anyone else experienced this? Is there any direction you could recommend me to go other than verifying certificates (already done, none expired, CAs are there, as well as the keys etc). Is there perhaps a way to connect the SSL exception with a specific task that caused it? Was it communication from a specific node? From master node maybe? Was it in regards to snapshots or whatever? Any help is highly appreciated! ### Related component Other ### To Reproduce 1. Setup a working cluster on 2.11 with Security plugin 2. Upgrade the cluster to 2.15 3. Check worker node logs for SSL exceptions ### Expected behavior No SSL exceptions post upgrade ### Additional Details **Plugins** Security plugin **Screenshots** If applicable, add screenshots to help explain your problem. **Host/Environment (please complete the following information):** - Docker image based OpenSearch 2.15 **Additional context** Add any other context about the problem here.

reta · November 5, 2024, 1:35pm

Yes, this is purely on HTTP client / server communication side. We are struggling to reproduce the issue, I am wondering if you have an opportunity to try previous release (2.14.x) and see if the issue is happening there as well. Thank you

kkoki · November 6, 2024, 6:41am

I have tested as follows. An error occurs regardless of the version…

2.16.0-debian-12-r0 → ERROR
2.14.0-debian-12-r2 → ERROR
2.13.0-debian-12-r3 → ERROR

For your reference, the exporter is obtaining stat information via HTTP requests to OpenSearch for monitoring purposes.
I suspect that the issue may be arising from there.

kkoki · November 6, 2024, 7:33am

It seems that the error does not occur when the service(prometheuscommunity/elasticsearch-exporter:v1.8.0) in use is stopped.

kkoki · November 6, 2024, 7:39am

For your reference, the error only occurs on the three master nodes in my case.

reta · November 7, 2024, 4:36pm

Thank you @kkoki , the presence of elasticsearch-exporter could give a hit, do you mind if I update [1] with your comments (or if you could do that, would be much appreciated). Thank you.

[1] [BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 · Issue #4718 · opensearch-project/security · GitHub

kkoki · November 8, 2024, 2:18am

Yes. It doesn’t matter. Thank you

Additionally, here is our exporter configuration.
There are error logs in OpenSearch, but it seems that the data is being collected properly.

shs_tech · July 4, 2025, 7:06pm

@kkoki @reta

I am observing similar issues in my Opensearch 3.0 cluster. Randomly the master nodes which were healthy are going into crashloopbackoff error.

The error i am getting is :

[2025-07-02T08:20:38,891][WARN ][o.o.t.TcpTransport ] [opensearch-master-2] exception caught on transport layer [Netty4TcpChannel{localAddress=/IP:9300, remoteAddress=/IP:32804}], closing connection java.lang.IllegalStateException: transport not ready yet to handle incoming requests

I have elasticsearch exporter running as we dont have support for opensearch metrics exporter plugin yet for 3.0

I am struggling to find the root cause of the issue. Is elasticsearch exporter the culprit?

pablo · July 7, 2025, 6:06pm

@shs_tech Do you see any pattern with this warning? Did you try disabling the elasticsearch exporter for a short time? How often do those warnings appear?

shs_tech · July 9, 2025, 5:43am

@pablo Whenever i tried without elasticsearch exporter, the issue didnt happen. What i observed is that, out of 5 master nodes, 3 were going into oomkilled & then further into crashloopbackoff with the “transport not ready” message.

I’ve increased the memory limit of master pods for testing. After that, the issue didnt happen. The environment is under observation.

pablo · July 9, 2025, 8:51am

@shs_tech Is this the same issue as in your other thread?

shs_tech · July 9, 2025, 10:57am

@pablo Yes…I raised it as a different topic as i specifically got it for Opensearch 3.0.0

Topic		Replies	Views
SSL connection error OpenSearch	10	2427	August 19, 2024
Node timeout/crash, will not resync, has to be killed OpenSearch troubleshoot	2	58	June 17, 2025
SSL issue when server is restarted OpenSearch troubleshoot , configure	1	1099	November 30, 2022
Opensearch is down when getting errors OpenSearch troubleshoot	2	245	December 10, 2023
SSL Error unable to configure Opensearch Cluster in 6 VMs Security troubleshoot , configure	14	1622	March 8, 2023

SSL Exception Connection Reset Error on Master Nodes

Related topics