SSL Exception Connection Reset Error on Master Nodes

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
bitnami/opensearch
2.15.0-debian-12-r4

Describe the issue:
We have configured an OpenSearch cluster with 3 master nodes and 6 data nodes.
The data is being indexed normally, and everything appears to be functioning without issues.
However, we encounter SSL Exception errors about once or twice an hour.
The cluster status shows that everything is healthy.
The problematic logs are only appearing on the three master nodes.

What could be causing the masters to experience connection resets?

Configuration:

opensearch:
  enabled: true
 
  extraConfig:
    plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
    plugins.security.ssl.http.enabled: true
    plugins.security.allow_default_init_securityindex: true
 
  security:
    enabled: true
    adminPassword: "1234"
    logstashPassword: "1234"
    tls:
      restEncryption: true
      autoGenerated: true
      verificationMode: "none"
 
  service:
    type: NodePort
    ports:
      restAPI: 9200
      transport: 9300
    nodePorts:
      restAPI: 31031
      transport: 31032
 
  master:
    replicaCount: 3
    resources:
      limits:
        cpu: "4000m"
        memory: "16Gi"
      requests:
        cpu: "4000m"
        memory: "16Gi"
    heapSize: 5120m
    persistence:
      size: 100Gi
 
 
  data:
    replicaCount: 6
    resources:
      limits:
        cpu: "4000m"
        memory: "16Gi"
      requests:
        cpu: "4000m"
        memory: "16Gi"
    heapSize: 5120m
    persistence:
      size: 4700Gi
    extraRoles:
      - "ingest"
  
  coordinating:
    replicaCount: 0
 
  ingest:
    enabled: false
    replicaCount: 0
 
  dashboards:
    enabled: true # :: dashboard
    service:
      type: NodePort
      nodePorts:
        http: 31030
    password: "1234"
    persistence:
      enabled: true
      size: 50Gi

Relevant Logs or Screenshots:

<_cat/nodes>

10.42.235.16  50 36 1 4.25 4.47 4.86 m  cluster_manager * common-opensearch-master-2
10.42.190.31  58 85 2 5.19 5.10 5.34 di data,ingest     - common-opensearch-data-0
10.42.137.212 37 35 1 4.94 3.43 2.59 m  cluster_manager - common-opensearch-master-0
10.42.137.253 72 93 2 4.94 3.43 2.59 di data,ingest     - common-opensearch-data-3
10.42.235.7   66 97 4 4.25 4.47 4.86 di data,ingest     - common-opensearch-data-4
10.42.134.219 35 35 0 1.31 1.95 2.39 m  cluster_manager - common-opensearch-master-1
10.42.134.214 38 91 0 1.31 1.95 2.39 di data,ingest     - common-opensearch-data-2
10.42.118.235 16 79 1 4.58 2.55 2.27 di data,ingest     - common-opensearch-data-1
10.42.69.152  26 95 1 5.06 4.52 4.12 di data,ingest     - common-opensearch-data-5

<_cluster/health>

{
  "cluster_name": "open",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 9,
  "number_of_data_nodes": 6,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 154,
  "active_shards": 323,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Does your cluster use ingress for connecting from users to cluster itself through 80 or 443 port?

ex. https://[cluster_name].[domain]

We access it only internally within the Kubernetes cluster via a NodePort service.
For example, using:
https://common-opensearch:9200
or
https://common-opensearch.{namespace}:9200

I checked the logs of the services, and the same SSL exception connection reset error is occurring.
Is the cause of the problem on the client trying to connect to the master, rather than within the cluster itself?

Looks like I’ve found the same bug issue.

Yes, this is purely on HTTP client / server communication side. We are struggling to reproduce the issue, I am wondering if you have an opportunity to try previous release (2.14.x) and see if the issue is happening there as well. Thank you

I have tested as follows. An error occurs regardless of the version… :smiling_face_with_tear:

2.16.0-debian-12-r0 → ERROR
2.14.0-debian-12-r2 → ERROR
2.13.0-debian-12-r3 → ERROR

For your reference, the exporter is obtaining stat information via HTTP requests to OpenSearch for monitoring purposes.
I suspect that the issue may be arising from there.

It seems that the error does not occur when the service(prometheuscommunity/elasticsearch-exporter:v1.8.0) in use is stopped.

For your reference, the error only occurs on the three master nodes in my case.

Thank you @kkoki , the presence of elasticsearch-exporter could give a hit, do you mind if I update [1] with your comments (or if you could do that, would be much appreciated). Thank you.

[1] [BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 · Issue #4718 · opensearch-project/security · GitHub

Yes. It doesn’t matter. Thank you

Additionally, here is our exporter configuration.
There are error logs in OpenSearch, but it seems that the data is being collected properly.

1 Like