Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.9
Describe the issue:
when one data node of my cluster drops:
- load on remaining data nodes jumps from ~4 to 16+
- as the loads climb, the data nodes appear to disconnect from the cluster, then reconnect
- this quickly results in all shards being unassigned and an unstable, nonfunctional cluster
- where no ingestion is taking place, and open-dashboards connections fail
any advice on what might be causing the jump in the load values on the nodes, and how to prevent the cluster from destabilizing will be appreciated.
~Jaimie
Configuration:
opensearch.yml
---
# OpenSearch v2.4+
# /etc/opensearch/opensearch.yml
################################################################################
# OpenSearch Cluster Architecture
# ==============================================================================
# Cluster Name: ENGLOG-OPENSRCH
# OpenDashboard + Opensearch Coordinating
# node010 has address 10.67.7.30
# node020 has address 10.67.7.40
# OpenDashboard + Opensearch Manager + Remote Client
# node011 has address 10.67.7.31
# node021 has address 10.67.7.41
# OpenDashboard + Opensearch Manager + Ingestion
# node012 has address 10.67.7.32
# node013 has address 10.67.7.33
# node022 has address 10.67.7.42
# node023 has address 10.67.7.43
# Opensearch Data
# node014 has address 10.67.7.34
# node015 has address 10.67.7.35
# node016 has address 10.67.7.36
# node017 has address 10.67.7.37
# node018 has address 10.67.7.38
# node019 has address 10.67.7.39
# node024 has address 10.67.7.44
# node025 has address 10.67.7.45
# node026 has address 10.67.7.46
# node027 has address 10.67.7.47
# node028 has address 10.67.7.48
# node029 has address 10.67.7.49
# Config for node011
cluster.name: ENGLOG-OPENSRCH
node.name: node011
network.host: 10.67.7.31
http.port: 9200
transport.tcp.port: 9300
node.master: true
node.data: false
node.ingest: false
node.remote_cluster_client: true
cluster.initial_master_nodes:
- 10.67.7.31
- 10.67.7.32
- 10.67.7.33
- 10.67.7.41
- 10.67.7.42
- 10.67.7.43
discovery.seed_hosts:
- 10.67.7.31
- 10.67.7.32
- 10.67.7.33
- 10.67.7.41
- 10.67.7.42
- 10.67.7.43
compatibility.override_main_response_version: true
action.auto_create_index: true
cluster.max_shards_per_node: "3000"
path.data: /data/opensearch
path.logs: /var/log/opensearch
path.repo: ["/mnt/logsnapshot"]
plugins.security.disabled: false
plugins.security.allow_default_init_securityindex: true
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: /etc/opensearch/certs/node011-crt.pem
plugins.security.ssl.http.pemkey_filepath: /etc/opensearch/certs/node011-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: /etc/opensearch/certs/englog-host-subca-crt.pem
plugins.security.ssl.http.clientauth_mode: OPTIONAL
plugins.security.ssl.transport.enabled: true
plugins.security.ssl.transport.resolve_hostname: false
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.transport.pemcert_filepath: /etc/opensearch/certs/node011-crt.pem
plugins.security.ssl.transport.pemkey_filepath: /etc/opensearch/certs/node011-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: /etc/opensearch/certs/englog-host-subca-crt.pem
plugins.security.ssl.transport.truststore_type: pkcs12
plugins.security.ssl.transport.truststore_filepath: /etc/opensearch/certs/englog-ts.pkcs12
plugins.security.ssl.transport.truststore_password: NOTTHIS
plugins.security.authcz.admin_dn:
- CN=admin,GIVENNAME=central-logging-admin
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.nodes_dn:
- 'CN=rdcenglog*'
plugins.security.restapi.roles_enabled:
- "all_access"
- "security_rest_api_access"
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices:
- ".opendistro-alerting-config"
- ".opendistro-alerting-alert*"
- ".opendistro-anomaly-results*"
- ".opendistro-anomaly-detector*"
- ".opendistro-anomaly-checkpoints"
- ".opendistro-anomaly-detection-state"
- ".opendistro-reports-*"
- ".opendistro-notifications-*"
- ".opendistro-notebooks"
- ".opensearch-observability"
- ".opendistro-asynchronous-search-response*"
- ".replication-metadata-store"
{
"persistent": {
"cluster.info.update.interval": "1m",
"cluster.routing.allocation.disk.watermark.flood_stage": "300gb",
"cluster.routing.allocation.disk.watermark.high": "400gb",
"cluster.routing.allocation.disk.watermark.low": "500gb",
"cluster.routing.allocation.enable": "all",
"cluster.routing.allocation.node_concurrent_recoveries": "16",
"cluster.routing.allocation.node_initial_primaries_recoveries": "16",
"cluster.routing.allocation.node_initial_replicas_recoveries": "16",
"cluster.routing.use_adaptive_replica_selection": "true",
"indices.recovery.max_bytes_per_sec": "200GB",
"plugins.index_state_management.metadata_migration.status": "1",
"plugins.index_state_management.template_migration.control": "-1"
},
"transient": {}
Relevant Logs or Screenshots:
Status just after node025 is dropped from cluster
C nodes have 2cpu and 32gb memory
I,R nodes have 4cpu and 64gb memory
D nodes have 6cpu and 64gb memory
timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks active_shards_percent
22:12:14 ENGLOG-OPENSRCH red 19 11 3383 2202 0 146 26264 404 11.4%
name master node.role heap.percent ram.percent cpu load_1m load_15m
node010 - - 6 92 6 0.00 0.01
node011 * mr 16 95 9 0.53 0.45
node012 - im 48 94 12 0.68 1.88
node013 - im 16 92 12 0.64 1.57
node014 - d 37 99 17 31.01 19.88
node015 - d 41 99 23 12.37 8.58
node016 - d 23 99 24 23.36 19.14
node017 - d 14 99 15 42.05 24.79
node018 - d 22 99 28 34.61 18.00
node019 - d 37 99 31 12.93 14.63
node020 - - 25 92 6 0.00 0.00
node021 - mr 49 94 3 0.07 0.05
node022 - im 19 93 12 0.76 1.60
node023 - im 39 92 12 0.81 1.52
node024 - d 37 99 21 24.13 15.71
node026 - d 23 99 27 8.41 9.72
node027 - d 39 99 25 12.46 13.77
node028 - d 42 99 16 8.35 8.93
node029 - d 46 99 34 10.57 11.94
Here are some logs from the point node025 dropped
[2024-02-14T16:46:21,798][INFO ][o.o.c.s.ClusterApplierService] [node011] removed {{node025}{kNqSpeeTT82xsaaHaGifyg}{G0LT9EFoR4q3cJnZWfYyWQ}{10.67.7.45}{10.67
.7.45:9300}{d}{shard_indexing_pressure_enabled=true}}, term: 97, version: 255842, reason: Publication{term=97, version=255842}
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.ADClusterEventListener] [node011] Cluster node changed, node removed: true, node added: false
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.HashRing ] [node011] Node removed: [kNqSpeeTT82xsaaHaGifyg]
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.HashRing ] [node011] Remove data node from AD version hash ring: kNqSpeeTT82xsaaHaGifyg
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.ADClusterEventListener] [node011] Hash ring build result: true
[2024-02-14T16:46:21,901][INFO ][o.o.a.c.HashRing ] [node011] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 4
[2024-02-14T16:46:21,901][INFO ][o.o.a.c.HashRing ] [node011] Build AD version hash ring successfully
[2024-02-14T16:46:21,901][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node011] Detected cluster change event for destination migration
[2024-02-14T16:46:21,901][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node011] Reset destination migration process.
[2024-02-14T16:46:21,950][INFO ][o.o.c.r.DelayedAllocationService] [node011] scheduling reroute for delayed shards in [28.1s] (2495 delayed shards)
[2024-02-14T16:46:21,950][WARN ][o.o.c.c.C.CoordinatorPublication] [node011] after [30.2s] publication of cluster state version [255842] is still waiting for {rdc
englog44}{7XjWOFBKSrCIgyg5L2HvxQ}{Z9PTkOe4SVu8Hy496RCBlA}{10.67.7.44}{10.67.7.44:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node018}{CY6
iIsLoQ5SnEm4Jr_7pFw}{KrKv0QjYS5yqhrjmYPPlBw}{10.67.7.38}{10.67.7.38:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node014}{Ty_PIQciS6C39L18
lshPBQ}{-HOORkjoQx28xDxgIPWG8Q}{10.67.7.34}{10.67.7.34:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node017}{jCYIhazARPGUJgfd5LZAFA}{GgWVZ
1mcSP2Pkwtwmnkxcw}{10.67.7.37}{10.67.7.37:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node015}{cxLB2bG9RG20QXac-2E3NQ}{F6Xl6YUnTu-zYoh_Ks
OSWQ}{10.67.7.35}{10.67.7.35:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node029}{JT2hCftTSLywoHJcDQZ8Ow}{TAJgl5wZRaqJkQs5aAddAA}{10.67.7
.49}{10.67.7.49:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node016}{4sVp7JQjQkeCMpgQosk3sg}{4Pf08q4oTgOTFh4gs6S6rQ}{10.67.7.36}{10.67.7.
36:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node019}{G697cU55RYGcUv396kUoGQ}{Gh_U5CLRQ-2QBa1KOKODUg}{10.67.7.39}{10.67.7.39:9300}{d}{s
hard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT]
… followed by lots of the following as loads on the remaining nodes climb and they drop away
[2024-02-14T16:48:28,544][WARN ][o.o.g.G.InternalReplicaShardAllocator] [node011] [nxlog-pf-olms-2023.11.28][0]: failed to list shard for shard_store on node [CY6iIsLoQ5SnEm4Jr_7pFw]
org.opensearch.action.FailedNodeException: Failed node [CY6iIsLoQ5SnEm4Jr_7pFw]
at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:308) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:282) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.transport.TransportService$6.handleException(TransportService.java:884) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:379) [opensearch-security-2.9.0.0.jar:2.9.0.0]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1504) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.transport.TransportService$9.run(TransportService.java:1356) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.9.0.jar:2.9.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.NodeDisconnectedException: [node018][10.67.7.38:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected