Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Version 2.3.0 of OpenSearch and OpenSearch-Dashboards
Describe the issue:
One of our Data Nodes is constantly leaving and joining the Cluster.
In the Log of this Server i an see that appling the Cluster state and checking the Health of the Filesystem takes more than 5 seconds.
The Values are from 10 - 40 Seconds the Node needs.
The OS of the Server is Almalinux 8.6 with the latest Updates installed and for the Filesystem we use ZFS on Linux with gzip-5 compression and encryption.
We also checked the Filesystem with the zfs tools and there are no Problems to notice.
Also the Node is not even on half the Load it can hold from the CPU and RAM usage.
Currtenly the Node is on 12 load from 32 CPU Cores.
The Configured JVM Heap is 31 GB and the Server isn’t on the RAM limit with this setting.
We increased the debug level but it didn’t logged more information about why the Node is that slow.
Even after a new installation or a reboot the Problem still persists.
Our full Server Concept is 3 Manager Nodes, 3 Coordinating Nodes, 27 Hot Nodes, 4 Warm Nodes and 2 Cold Nodes.
We are saving currently 444 Indices with 4949 Shards and about 90TB.
Configuration:
Our Configuration for the Cluster is almost the default settings except the watermarks.
For every Node we specified these Values in the opensearch.yml except some node specific parameters
cluster.name: opensearch-cluster
bootstrap.memory_lock: true
node.name: ${HOSTNAME}
http.max_content_length: 500mb
indices.query.bool.max_clause_count: 70000
thread_pool.write.queue_size: 10000
thread_pool.search.queue_size: 10000
gateway.auto_import_dangling_indices: true
thread_pool.search.max_queue_size: 10000
thread_pool.search.min_queue_size: 10000
node.roles: [ data ]
node.attr.temp: warm
node.attr.zone: rz
path.data: /data
path.logs: /var/log/opensearch
network.host: 0.0.0.0
http.port: 9211
transport.port: 9311
cluster.initial_master_nodes: [“10.0.0.2:9311”, “10.0.0.3:9311”, “10.0.0.4:9311”]
discovery.seed_hosts: [“master1:9311”, “master2:9311”, “master3:9311”]
cluster.publish.timeout: 90000ms
cluster.follower_lag.timeout: 200000ms
plugins.security.ssl.transport.pemcert_filepath: /etc/opensearch/certs/node.pem
plugins.security.ssl.transport.pemkey_filepath: /etc/opensearch/certs/node-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: /etc/opensearch/certs/ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: /etc/opensearch/certs/node.pem
plugins.security.ssl.http.pemkey_filepath: /etc/opensearch/certs/node-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: /etc/opensearch/certs/ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
- ‘CN=X,OU=X,O=X,L=X,ST=X,C=X’
plugins.security.nodes_dn: - CN=X,CN=X,CN=X,CN=srvar0258,CN=X,E=X,CN=X,OU=X,O=X,L=X,S=X,C=X
- CN=X,CN=X,CN=X,CN=srvar0258,CN=X,E=X,CN=X,OU=X,O=X,L=X,S=X,C=X
- CN=X,OU=X,O=X,E=X
- CN=X,OU=X,O=X,L=X,ST=X,C=X
- CN=X,OU=X,O=X,L=X,ST=X,C=X
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: [“all_access”, “security_rest_api_access”]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [“.plugins-ml-model”, “.plugins-ml-task”, “.opendistro-alerting-config”, “.opendistro-alerting-alert*”, “.opendistro-anomaly-results*”, “.opendistro-anomaly-d tector*”, “.opendistro-anomaly-checkpoints”, “.opendistro-anomaly-detection-state”, “.opendistro-reports-", ".opensearch-notifications-”, “.opensearch-notebooks”, “.opensearch-observability”, “.ope distro-asynchronous-search-response*”, “.replication-metadata-store”]
node.max_local_storage_nodes: 3
plugins.security.ssl.http.clientauth_mode: “OPTIONAL”
Relevant Logs or Screenshots:
On the Problematic Server i can see this in the OpenSearch log:
And this is the Log from one of the Manager Nodes:
I hope there are enough Information about this Issue.
First of all i wanted to know how to futher debug this Issue.
Secondly i am interested if this Concept we are having is working with our Hardware or if this are too many Shards and Indices for the amount of Nodes.
And lastly i would know if there are better Concepts we should try to work on like moving old Data to a seperate OpenSearch Cluster and so on.
Many thanks in Advance.