SSL Exception Connection Reset Error on Master Nodes

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
bitnami/opensearch
2.15.0-debian-12-r4

Describe the issue:
We have configured an OpenSearch cluster with 3 master nodes and 6 data nodes.
The data is being indexed normally, and everything appears to be functioning without issues.
However, we encounter SSL Exception errors about once or twice an hour.
The cluster status shows that everything is healthy.
The problematic logs are only appearing on the three master nodes.

What could be causing the masters to experience connection resets?

Configuration:

opensearch:
  enabled: true
 
  extraConfig:
    plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
    plugins.security.ssl.http.enabled: true
    plugins.security.allow_default_init_securityindex: true
 
  security:
    enabled: true
    adminPassword: "1234"
    logstashPassword: "1234"
    tls:
      restEncryption: true
      autoGenerated: true
      verificationMode: "none"
 
  service:
    type: NodePort
    ports:
      restAPI: 9200
      transport: 9300
    nodePorts:
      restAPI: 31031
      transport: 31032
 
  master:
    replicaCount: 3
    resources:
      limits:
        cpu: "4000m"
        memory: "16Gi"
      requests:
        cpu: "4000m"
        memory: "16Gi"
    heapSize: 5120m
    persistence:
      size: 100Gi
 
 
  data:
    replicaCount: 6
    resources:
      limits:
        cpu: "4000m"
        memory: "16Gi"
      requests:
        cpu: "4000m"
        memory: "16Gi"
    heapSize: 5120m
    persistence:
      size: 4700Gi
    extraRoles:
      - "ingest"
  
  coordinating:
    replicaCount: 0
 
  ingest:
    enabled: false
    replicaCount: 0
 
  dashboards:
    enabled: true # :: dashboard
    service:
      type: NodePort
      nodePorts:
        http: 31030
    password: "1234"
    persistence:
      enabled: true
      size: 50Gi

Relevant Logs or Screenshots:

<_cat/nodes>

10.42.235.16  50 36 1 4.25 4.47 4.86 m  cluster_manager * common-opensearch-master-2
10.42.190.31  58 85 2 5.19 5.10 5.34 di data,ingest     - common-opensearch-data-0
10.42.137.212 37 35 1 4.94 3.43 2.59 m  cluster_manager - common-opensearch-master-0
10.42.137.253 72 93 2 4.94 3.43 2.59 di data,ingest     - common-opensearch-data-3
10.42.235.7   66 97 4 4.25 4.47 4.86 di data,ingest     - common-opensearch-data-4
10.42.134.219 35 35 0 1.31 1.95 2.39 m  cluster_manager - common-opensearch-master-1
10.42.134.214 38 91 0 1.31 1.95 2.39 di data,ingest     - common-opensearch-data-2
10.42.118.235 16 79 1 4.58 2.55 2.27 di data,ingest     - common-opensearch-data-1
10.42.69.152  26 95 1 5.06 4.52 4.12 di data,ingest     - common-opensearch-data-5

<_cluster/health>

{
  "cluster_name": "open",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 9,
  "number_of_data_nodes": 6,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 154,
  "active_shards": 323,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Does your cluster use ingress for connecting from users to cluster itself through 80 or 443 port?

ex. https://[cluster_name].[domain]

We access it only internally within the Kubernetes cluster via a NodePort service.
For example, using:
https://common-opensearch:9200
or
https://common-opensearch.{namespace}:9200

I checked the logs of the services, and the same SSL exception connection reset error is occurring.
Is the cause of the problem on the client trying to connect to the master, rather than within the cluster itself?

Looks like I’ve found the same bug issue.

Yes, this is purely on HTTP client / server communication side. We are struggling to reproduce the issue, I am wondering if you have an opportunity to try previous release (2.14.x) and see if the issue is happening there as well. Thank you

I have tested as follows. An error occurs regardless of the version… :smiling_face_with_tear:

2.16.0-debian-12-r0 → ERROR
2.14.0-debian-12-r2 → ERROR
2.13.0-debian-12-r3 → ERROR

For your reference, the exporter is obtaining stat information via HTTP requests to OpenSearch for monitoring purposes.
I suspect that the issue may be arising from there.

It seems that the error does not occur when the service(prometheuscommunity/elasticsearch-exporter:v1.8.0) in use is stopped.

For your reference, the error only occurs on the three master nodes in my case.

Thank you @kkoki , the presence of elasticsearch-exporter could give a hit, do you mind if I update [1] with your comments (or if you could do that, would be much appreciated). Thank you.

[1] [BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 · Issue #4718 · opensearch-project/security · GitHub

Yes. It doesn’t matter. Thank you

Additionally, here is our exporter configuration.
There are error logs in OpenSearch, but it seems that the data is being collected properly.

1 Like

@kkoki @reta

I am observing similar issues in my Opensearch 3.0 cluster. Randomly the master nodes which were healthy are going into crashloopbackoff error.

The error i am getting is :

[2025-07-02T08:20:38,891][WARN ][o.o.t.TcpTransport ] [opensearch-master-2] exception caught on transport layer [Netty4TcpChannel{localAddress=/IP:9300, remoteAddress=/IP:32804}], closing connection java.lang.IllegalStateException: transport not ready yet to handle incoming requests

I have elasticsearch exporter running as we dont have support for opensearch metrics exporter plugin yet for 3.0

I am struggling to find the root cause of the issue. Is elasticsearch exporter the culprit?

@shs_tech Do you see any pattern with this warning? Did you try disabling the elasticsearch exporter for a short time? How often do those warnings appear?

@pablo Whenever i tried without elasticsearch exporter, the issue didnt happen. What i observed is that, out of 5 master nodes, 3 were going into oomkilled & then further into crashloopbackoff with the “transport not ready” message.

I’ve increased the memory limit of master pods for testing. After that, the issue didnt happen. The environment is under observation.

@shs_tech Is this the same issue as in your other thread?

@pablo Yes…I raised it as a different topic as i specifically got it for Opensearch 3.0.0