Dropping 1 node of cluster results unstable cluster and all shards being unassigned

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.9

Describe the issue:

when one data node of my cluster drops:

  • load on remaining data nodes jumps from ~4 to 16+
  • as the loads climb, the data nodes appear to disconnect from the cluster, then reconnect
  • this quickly results in all shards being unassigned and an unstable, nonfunctional cluster
  • where no ingestion is taking place, and open-dashboards connections fail

any advice on what might be causing the jump in the load values on the nodes, and how to prevent the cluster from destabilizing will be appreciated.

~Jaimie

Configuration:

opensearch.yml
---

# OpenSearch v2.4+
# /etc/opensearch/opensearch.yml
################################################################################

# OpenSearch Cluster Architecture
# ==============================================================================
# Cluster Name: ENGLOG-OPENSRCH
# OpenDashboard + Opensearch Coordinating
  # node010 has address 10.67.7.30
  # node020 has address 10.67.7.40
# OpenDashboard + Opensearch Manager + Remote Client
  # node011 has address 10.67.7.31
  # node021 has address 10.67.7.41
# OpenDashboard + Opensearch Manager + Ingestion
  # node012 has address 10.67.7.32
  # node013 has address 10.67.7.33
  # node022 has address 10.67.7.42
  # node023 has address 10.67.7.43
# Opensearch Data
  # node014 has address 10.67.7.34
  # node015 has address 10.67.7.35
  # node016 has address 10.67.7.36
  # node017 has address 10.67.7.37
  # node018 has address 10.67.7.38
  # node019 has address 10.67.7.39
  # node024 has address 10.67.7.44
  # node025 has address 10.67.7.45
  # node026 has address 10.67.7.46
  # node027 has address 10.67.7.47
  # node028 has address 10.67.7.48
  # node029 has address 10.67.7.49

# Config for node011

cluster.name: ENGLOG-OPENSRCH
node.name: node011
network.host: 10.67.7.31

http.port: 9200

transport.tcp.port: 9300

node.master: true
node.data:   false
node.ingest: false
node.remote_cluster_client: true

cluster.initial_master_nodes:
  - 10.67.7.31
  - 10.67.7.32
  - 10.67.7.33
  - 10.67.7.41
  - 10.67.7.42
  - 10.67.7.43

discovery.seed_hosts:
  - 10.67.7.31
  - 10.67.7.32
  - 10.67.7.33
  - 10.67.7.41
  - 10.67.7.42
  - 10.67.7.43


compatibility.override_main_response_version: true

action.auto_create_index: true
cluster.max_shards_per_node: "3000"

path.data: /data/opensearch
path.logs: /var/log/opensearch
path.repo: ["/mnt/logsnapshot"]


plugins.security.disabled: false
plugins.security.allow_default_init_securityindex: true


plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: /etc/opensearch/certs/node011-crt.pem
plugins.security.ssl.http.pemkey_filepath: /etc/opensearch/certs/node011-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: /etc/opensearch/certs/englog-host-subca-crt.pem

plugins.security.ssl.http.clientauth_mode: OPTIONAL

plugins.security.ssl.transport.enabled: true
plugins.security.ssl.transport.resolve_hostname: false
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.transport.pemcert_filepath: /etc/opensearch/certs/node011-crt.pem
plugins.security.ssl.transport.pemkey_filepath: /etc/opensearch/certs/node011-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: /etc/opensearch/certs/englog-host-subca-crt.pem

plugins.security.ssl.transport.truststore_type: pkcs12
plugins.security.ssl.transport.truststore_filepath: /etc/opensearch/certs/englog-ts.pkcs12
plugins.security.ssl.transport.truststore_password: NOTTHIS

plugins.security.authcz.admin_dn:
  - CN=admin,GIVENNAME=central-logging-admin

plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.enable_snapshot_restore_privilege: true

plugins.security.nodes_dn:
  - 'CN=rdcenglog*'

plugins.security.restapi.roles_enabled:
  - "all_access"
  - "security_rest_api_access"

plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices:
  - ".opendistro-alerting-config"
  - ".opendistro-alerting-alert*"
  - ".opendistro-anomaly-results*"
  - ".opendistro-anomaly-detector*"
  - ".opendistro-anomaly-checkpoints"
  - ".opendistro-anomaly-detection-state"
  - ".opendistro-reports-*"
  - ".opendistro-notifications-*"
  - ".opendistro-notebooks"
  - ".opensearch-observability"
  - ".opendistro-asynchronous-search-response*"
  - ".replication-metadata-store"


{
  "persistent": {
    "cluster.info.update.interval": "1m",
    "cluster.routing.allocation.disk.watermark.flood_stage": "300gb",
    "cluster.routing.allocation.disk.watermark.high": "400gb",
    "cluster.routing.allocation.disk.watermark.low": "500gb",
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.allocation.node_concurrent_recoveries": "16",
    "cluster.routing.allocation.node_initial_primaries_recoveries": "16",
    "cluster.routing.allocation.node_initial_replicas_recoveries": "16",
    "cluster.routing.use_adaptive_replica_selection": "true",
    "indices.recovery.max_bytes_per_sec": "200GB",
    "plugins.index_state_management.metadata_migration.status": "1",
    "plugins.index_state_management.template_migration.control": "-1"
  },
  "transient": {}

Relevant Logs or Screenshots:

Status just after node025 is dropped from cluster
C nodes have 2cpu and 32gb memory
I,R nodes have 4cpu and 64gb memory
D nodes have 6cpu and 64gb memory

timestamp cluster         status node.total node.data shards  pri relo init unassign pending_tasks active_shards_percent
22:12:14  ENGLOG-OPENSRCH red            19        11   3383 2202    0  146    26264           404                 11.4%


name        master node.role heap.percent ram.percent cpu load_1m load_15m
node010 -      -                    6          92   6    0.00     0.01
node011 *      mr                  16          95   9    0.53     0.45
node012 -      im                  48          94  12    0.68     1.88
node013 -      im                  16          92  12    0.64     1.57
node014 -      d                   37          99  17   31.01    19.88
node015 -      d                   41          99  23   12.37     8.58
node016 -      d                   23          99  24   23.36    19.14
node017 -      d                   14          99  15   42.05    24.79
node018 -      d                   22          99  28   34.61    18.00
node019 -      d                   37          99  31   12.93    14.63
node020 -      -                   25          92   6    0.00     0.00
node021 -      mr                  49          94   3    0.07     0.05
node022 -      im                  19          93  12    0.76     1.60
node023 -      im                  39          92  12    0.81     1.52
node024 -      d                   37          99  21   24.13    15.71
node026 -      d                   23          99  27    8.41     9.72
node027 -      d                   39          99  25   12.46    13.77
node028 -      d                   42          99  16    8.35     8.93
node029 -      d                   46          99  34   10.57    11.94

Here are some logs from the point node025 dropped

[2024-02-14T16:46:21,798][INFO ][o.o.c.s.ClusterApplierService] [node011] removed {{node025}{kNqSpeeTT82xsaaHaGifyg}{G0LT9EFoR4q3cJnZWfYyWQ}{10.67.7.45}{10.67
.7.45:9300}{d}{shard_indexing_pressure_enabled=true}}, term: 97, version: 255842, reason: Publication{term=97, version=255842}
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.ADClusterEventListener] [node011] Cluster node changed, node removed: true, node added: false
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.HashRing         ] [node011] Node removed: [kNqSpeeTT82xsaaHaGifyg]
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.HashRing         ] [node011] Remove data node from AD version hash ring: kNqSpeeTT82xsaaHaGifyg
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.ADClusterEventListener] [node011] Hash ring build result: true
[2024-02-14T16:46:21,901][INFO ][o.o.a.c.HashRing         ] [node011] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 4
[2024-02-14T16:46:21,901][INFO ][o.o.a.c.HashRing         ] [node011] Build AD version hash ring successfully
[2024-02-14T16:46:21,901][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node011] Detected cluster change event for destination migration
[2024-02-14T16:46:21,901][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node011] Reset destination migration process.
[2024-02-14T16:46:21,950][INFO ][o.o.c.r.DelayedAllocationService] [node011] scheduling reroute for delayed shards in [28.1s] (2495 delayed shards)
[2024-02-14T16:46:21,950][WARN ][o.o.c.c.C.CoordinatorPublication] [node011] after [30.2s] publication of cluster state version [255842] is still waiting for {rdc
englog44}{7XjWOFBKSrCIgyg5L2HvxQ}{Z9PTkOe4SVu8Hy496RCBlA}{10.67.7.44}{10.67.7.44:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node018}{CY6
iIsLoQ5SnEm4Jr_7pFw}{KrKv0QjYS5yqhrjmYPPlBw}{10.67.7.38}{10.67.7.38:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node014}{Ty_PIQciS6C39L18
lshPBQ}{-HOORkjoQx28xDxgIPWG8Q}{10.67.7.34}{10.67.7.34:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node017}{jCYIhazARPGUJgfd5LZAFA}{GgWVZ
1mcSP2Pkwtwmnkxcw}{10.67.7.37}{10.67.7.37:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node015}{cxLB2bG9RG20QXac-2E3NQ}{F6Xl6YUnTu-zYoh_Ks
OSWQ}{10.67.7.35}{10.67.7.35:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node029}{JT2hCftTSLywoHJcDQZ8Ow}{TAJgl5wZRaqJkQs5aAddAA}{10.67.7
.49}{10.67.7.49:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node016}{4sVp7JQjQkeCMpgQosk3sg}{4Pf08q4oTgOTFh4gs6S6rQ}{10.67.7.36}{10.67.7.
36:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node019}{G697cU55RYGcUv396kUoGQ}{Gh_U5CLRQ-2QBa1KOKODUg}{10.67.7.39}{10.67.7.39:9300}{d}{s
hard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT]

… followed by lots of the following as loads on the remaining nodes climb and they drop away

[2024-02-14T16:48:28,544][WARN ][o.o.g.G.InternalReplicaShardAllocator] [node011] [nxlog-pf-olms-2023.11.28][0]: failed to list shard for shard_store on node [CY6iIsLoQ5SnEm4Jr_7pFw]
org.opensearch.action.FailedNodeException: Failed node [CY6iIsLoQ5SnEm4Jr_7pFw]
        at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:308) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:282) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.transport.TransportService$6.handleException(TransportService.java:884) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:379) [opensearch-security-2.9.0.0.jar:2.9.0.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1504) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.transport.TransportService$9.run(TransportService.java:1356) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.9.0.jar:2.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.NodeDisconnectedException: [node018][10.67.7.38:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected



Hi, any updates about his?

No updates. No change to cluster response to a dropped node. The cluster will eventually recover, but it takes a long time.

We currently operate on v2.9.0 and plan to upgrade the cluster to 2.15 in July or August, after a thorough evaluation.

~Jaimie