Node frequently leaving and joining the Cluster

Jostrus · November 3, 2022, 11:02am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Version 2.3.0 of OpenSearch and OpenSearch-Dashboards

Describe the issue:
One of our Data Nodes is constantly leaving and joining the Cluster.
In the Log of this Server i an see that appling the Cluster state and checking the Health of the Filesystem takes more than 5 seconds.
The Values are from 10 - 40 Seconds the Node needs.
The OS of the Server is Almalinux 8.6 with the latest Updates installed and for the Filesystem we use ZFS on Linux with gzip-5 compression and encryption.
We also checked the Filesystem with the zfs tools and there are no Problems to notice.
Also the Node is not even on half the Load it can hold from the CPU and RAM usage.
Currtenly the Node is on 12 load from 32 CPU Cores.
The Configured JVM Heap is 31 GB and the Server isn’t on the RAM limit with this setting.
We increased the debug level but it didn’t logged more information about why the Node is that slow.
Even after a new installation or a reboot the Problem still persists.

Our full Server Concept is 3 Manager Nodes, 3 Coordinating Nodes, 27 Hot Nodes, 4 Warm Nodes and 2 Cold Nodes.
We are saving currently 444 Indices with 4949 Shards and about 90TB.

Configuration:
Our Configuration for the Cluster is almost the default settings except the watermarks.
For every Node we specified these Values in the opensearch.yml except some node specific parameters

cluster.name: opensearch-cluster
bootstrap.memory_lock: true
node.name: ${HOSTNAME}
http.max_content_length: 500mb
indices.query.bool.max_clause_count: 70000
thread_pool.write.queue_size: 10000
thread_pool.search.queue_size: 10000
gateway.auto_import_dangling_indices: true
thread_pool.search.max_queue_size: 10000
thread_pool.search.min_queue_size: 10000
node.roles: [ data ]
node.attr.temp: warm
node.attr.zone: rz
path.data: /data
path.logs: /var/log/opensearch
network.host: 0.0.0.0
http.port: 9211
transport.port: 9311
cluster.initial_master_nodes: [“10.0.0.2:9311”, “10.0.0.3:9311”, “10.0.0.4:9311”]
discovery.seed_hosts: [“master1:9311”, “master2:9311”, “master3:9311”]
cluster.publish.timeout: 90000ms
cluster.follower_lag.timeout: 200000ms
plugins.security.ssl.transport.pemcert_filepath: /etc/opensearch/certs/node.pem
plugins.security.ssl.transport.pemkey_filepath: /etc/opensearch/certs/node-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: /etc/opensearch/certs/ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: /etc/opensearch/certs/node.pem
plugins.security.ssl.http.pemkey_filepath: /etc/opensearch/certs/node-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: /etc/opensearch/certs/ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:

‘CN=X,OU=X,O=X,L=X,ST=X,C=X’
plugins.security.nodes_dn:
CN=X,CN=X,CN=X,CN=srvar0258,CN=X,E=X,CN=X,OU=X,O=X,L=X,S=X,C=X
CN=X,CN=X,CN=X,CN=srvar0258,CN=X,E=X,CN=X,OU=X,O=X,L=X,S=X,C=X
CN=X,OU=X,O=X,E=X
CN=X,OU=X,O=X,L=X,ST=X,C=X
CN=X,OU=X,O=X,L=X,ST=X,C=X
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: [“all_access”, “security_rest_api_access”]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [“.plugins-ml-model”, “.plugins-ml-task”, “.opendistro-alerting-config”, “.opendistro-alerting-alert*”, “.opendistro-anomaly-results*”, “.opendistro-anomaly-d tector*”, “.opendistro-anomaly-checkpoints”, “.opendistro-anomaly-detection-state”, “.opendistro-reports-", ".opensearch-notifications-”, “.opensearch-notebooks”, “.opensearch-observability”, “.ope distro-asynchronous-search-response*”, “.replication-metadata-store”]
node.max_local_storage_nodes: 3
plugins.security.ssl.http.clientauth_mode: “OPTIONAL”

Relevant Logs or Screenshots:
On the Problematic Server i can see this in the OpenSearch log:

And this is the Log from one of the Manager Nodes:

I hope there are enough Information about this Issue.
First of all i wanted to know how to futher debug this Issue.
Secondly i am interested if this Concept we are having is working with our Hardware or if this are too many Shards and Indices for the amount of Nodes.
And lastly i would know if there are better Concepts we should try to work on like moving old Data to a seperate OpenSearch Cluster and so on.

Many thanks in Advance.

Topic		Replies	Views
Unstable cluster and lagging non-stop however 30 nodes with 40 GB memory each OpenSearch troubleshoot	4	36	June 11, 2025
Opensearch Query Very Slow OpenSearch troubleshoot	1	20	June 18, 2025
Data skew on opensearch cluster OpenSearch	1	174	July 16, 2024
Search request rate imbalance OpenSearch troubleshoot	2	253	July 10, 2024
The coordinator exits the cluster OpenSearch	1	281	August 7, 2023

Node frequently leaving and joining the Cluster

Related topics