Dropping 1 node of cluster results unstable cluster and all shards being unassigned

jaimie.livingston · February 14, 2024, 11:19pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.9

Describe the issue:

when one data node of my cluster drops:

load on remaining data nodes jumps from ~4 to 16+
as the loads climb, the data nodes appear to disconnect from the cluster, then reconnect
this quickly results in all shards being unassigned and an unstable, nonfunctional cluster
where no ingestion is taking place, and open-dashboards connections fail

any advice on what might be causing the jump in the load values on the nodes, and how to prevent the cluster from destabilizing will be appreciated.

~Jaimie

Configuration:

opensearch.yml
---

# OpenSearch v2.4+
# /etc/opensearch/opensearch.yml
################################################################################

# OpenSearch Cluster Architecture
# ==============================================================================
# Cluster Name: ENGLOG-OPENSRCH
# OpenDashboard + Opensearch Coordinating
  # node010 has address 10.67.7.30
  # node020 has address 10.67.7.40
# OpenDashboard + Opensearch Manager + Remote Client
  # node011 has address 10.67.7.31
  # node021 has address 10.67.7.41
# OpenDashboard + Opensearch Manager + Ingestion
  # node012 has address 10.67.7.32
  # node013 has address 10.67.7.33
  # node022 has address 10.67.7.42
  # node023 has address 10.67.7.43
# Opensearch Data
  # node014 has address 10.67.7.34
  # node015 has address 10.67.7.35
  # node016 has address 10.67.7.36
  # node017 has address 10.67.7.37
  # node018 has address 10.67.7.38
  # node019 has address 10.67.7.39
  # node024 has address 10.67.7.44
  # node025 has address 10.67.7.45
  # node026 has address 10.67.7.46
  # node027 has address 10.67.7.47
  # node028 has address 10.67.7.48
  # node029 has address 10.67.7.49

# Config for node011

cluster.name: ENGLOG-OPENSRCH
node.name: node011
network.host: 10.67.7.31

http.port: 9200

transport.tcp.port: 9300

node.master: true
node.data:   false
node.ingest: false
node.remote_cluster_client: true

cluster.initial_master_nodes:
  - 10.67.7.31
  - 10.67.7.32
  - 10.67.7.33
  - 10.67.7.41
  - 10.67.7.42
  - 10.67.7.43

discovery.seed_hosts:
  - 10.67.7.31
  - 10.67.7.32
  - 10.67.7.33
  - 10.67.7.41
  - 10.67.7.42
  - 10.67.7.43


compatibility.override_main_response_version: true

action.auto_create_index: true
cluster.max_shards_per_node: "3000"

path.data: /data/opensearch
path.logs: /var/log/opensearch
path.repo: ["/mnt/logsnapshot"]


plugins.security.disabled: false
plugins.security.allow_default_init_securityindex: true


plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: /etc/opensearch/certs/node011-crt.pem
plugins.security.ssl.http.pemkey_filepath: /etc/opensearch/certs/node011-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: /etc/opensearch/certs/englog-host-subca-crt.pem

plugins.security.ssl.http.clientauth_mode: OPTIONAL

plugins.security.ssl.transport.enabled: true
plugins.security.ssl.transport.resolve_hostname: false
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.transport.pemcert_filepath: /etc/opensearch/certs/node011-crt.pem
plugins.security.ssl.transport.pemkey_filepath: /etc/opensearch/certs/node011-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: /etc/opensearch/certs/englog-host-subca-crt.pem

plugins.security.ssl.transport.truststore_type: pkcs12
plugins.security.ssl.transport.truststore_filepath: /etc/opensearch/certs/englog-ts.pkcs12
plugins.security.ssl.transport.truststore_password: NOTTHIS

plugins.security.authcz.admin_dn:
  - CN=admin,GIVENNAME=central-logging-admin

plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.enable_snapshot_restore_privilege: true

plugins.security.nodes_dn:
  - 'CN=rdcenglog*'

plugins.security.restapi.roles_enabled:
  - "all_access"
  - "security_rest_api_access"

plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices:
  - ".opendistro-alerting-config"
  - ".opendistro-alerting-alert*"
  - ".opendistro-anomaly-results*"
  - ".opendistro-anomaly-detector*"
  - ".opendistro-anomaly-checkpoints"
  - ".opendistro-anomaly-detection-state"
  - ".opendistro-reports-*"
  - ".opendistro-notifications-*"
  - ".opendistro-notebooks"
  - ".opensearch-observability"
  - ".opendistro-asynchronous-search-response*"
  - ".replication-metadata-store"

{
  "persistent": {
    "cluster.info.update.interval": "1m",
    "cluster.routing.allocation.disk.watermark.flood_stage": "300gb",
    "cluster.routing.allocation.disk.watermark.high": "400gb",
    "cluster.routing.allocation.disk.watermark.low": "500gb",
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.allocation.node_concurrent_recoveries": "16",
    "cluster.routing.allocation.node_initial_primaries_recoveries": "16",
    "cluster.routing.allocation.node_initial_replicas_recoveries": "16",
    "cluster.routing.use_adaptive_replica_selection": "true",
    "indices.recovery.max_bytes_per_sec": "200GB",
    "plugins.index_state_management.metadata_migration.status": "1",
    "plugins.index_state_management.template_migration.control": "-1"
  },
  "transient": {}

Relevant Logs or Screenshots:

Status just after node025 is dropped from cluster
C nodes have 2cpu and 32gb memory
I,R nodes have 4cpu and 64gb memory
D nodes have 6cpu and 64gb memory

timestamp cluster         status node.total node.data shards  pri relo init unassign pending_tasks active_shards_percent
22:12:14  ENGLOG-OPENSRCH red            19        11   3383 2202    0  146    26264           404                 11.4%


name        master node.role heap.percent ram.percent cpu load_1m load_15m
node010 -      -                    6          92   6    0.00     0.01
node011 *      mr                  16          95   9    0.53     0.45
node012 -      im                  48          94  12    0.68     1.88
node013 -      im                  16          92  12    0.64     1.57
node014 -      d                   37          99  17   31.01    19.88
node015 -      d                   41          99  23   12.37     8.58
node016 -      d                   23          99  24   23.36    19.14
node017 -      d                   14          99  15   42.05    24.79
node018 -      d                   22          99  28   34.61    18.00
node019 -      d                   37          99  31   12.93    14.63
node020 -      -                   25          92   6    0.00     0.00
node021 -      mr                  49          94   3    0.07     0.05
node022 -      im                  19          93  12    0.76     1.60
node023 -      im                  39          92  12    0.81     1.52
node024 -      d                   37          99  21   24.13    15.71
node026 -      d                   23          99  27    8.41     9.72
node027 -      d                   39          99  25   12.46    13.77
node028 -      d                   42          99  16    8.35     8.93
node029 -      d                   46          99  34   10.57    11.94

Here are some logs from the point node025 dropped

[2024-02-14T16:46:21,798][INFO ][o.o.c.s.ClusterApplierService] [node011] removed {{node025}{kNqSpeeTT82xsaaHaGifyg}{G0LT9EFoR4q3cJnZWfYyWQ}{10.67.7.45}{10.67
.7.45:9300}{d}{shard_indexing_pressure_enabled=true}}, term: 97, version: 255842, reason: Publication{term=97, version=255842}
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.ADClusterEventListener] [node011] Cluster node changed, node removed: true, node added: false
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.HashRing         ] [node011] Node removed: [kNqSpeeTT82xsaaHaGifyg]
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.HashRing         ] [node011] Remove data node from AD version hash ring: kNqSpeeTT82xsaaHaGifyg
[2024-02-14T16:46:21,900][INFO ][o.o.a.c.ADClusterEventListener] [node011] Hash ring build result: true
[2024-02-14T16:46:21,901][INFO ][o.o.a.c.HashRing         ] [node011] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 4
[2024-02-14T16:46:21,901][INFO ][o.o.a.c.HashRing         ] [node011] Build AD version hash ring successfully
[2024-02-14T16:46:21,901][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node011] Detected cluster change event for destination migration
[2024-02-14T16:46:21,901][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [node011] Reset destination migration process.
[2024-02-14T16:46:21,950][INFO ][o.o.c.r.DelayedAllocationService] [node011] scheduling reroute for delayed shards in [28.1s] (2495 delayed shards)
[2024-02-14T16:46:21,950][WARN ][o.o.c.c.C.CoordinatorPublication] [node011] after [30.2s] publication of cluster state version [255842] is still waiting for {rdc
englog44}{7XjWOFBKSrCIgyg5L2HvxQ}{Z9PTkOe4SVu8Hy496RCBlA}{10.67.7.44}{10.67.7.44:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node018}{CY6
iIsLoQ5SnEm4Jr_7pFw}{KrKv0QjYS5yqhrjmYPPlBw}{10.67.7.38}{10.67.7.38:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node014}{Ty_PIQciS6C39L18
lshPBQ}{-HOORkjoQx28xDxgIPWG8Q}{10.67.7.34}{10.67.7.34:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node017}{jCYIhazARPGUJgfd5LZAFA}{GgWVZ
1mcSP2Pkwtwmnkxcw}{10.67.7.37}{10.67.7.37:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node015}{cxLB2bG9RG20QXac-2E3NQ}{F6Xl6YUnTu-zYoh_Ks
OSWQ}{10.67.7.35}{10.67.7.35:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node029}{JT2hCftTSLywoHJcDQZ8Ow}{TAJgl5wZRaqJkQs5aAddAA}{10.67.7
.49}{10.67.7.49:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node016}{4sVp7JQjQkeCMpgQosk3sg}{4Pf08q4oTgOTFh4gs6S6rQ}{10.67.7.36}{10.67.7.
36:9300}{d}{shard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT], {node019}{G697cU55RYGcUv396kUoGQ}{Gh_U5CLRQ-2QBa1KOKODUg}{10.67.7.39}{10.67.7.39:9300}{d}{s
hard_indexing_pressure_enabled=true} [SENT_APPLY_COMMIT]

… followed by lots of the following as loads on the remaining nodes climb and they drop away

[2024-02-14T16:48:28,544][WARN ][o.o.g.G.InternalReplicaShardAllocator] [node011] [nxlog-pf-olms-2023.11.28][0]: failed to list shard for shard_store on node [CY6iIsLoQ5SnEm4Jr_7pFw]
org.opensearch.action.FailedNodeException: Failed node [CY6iIsLoQ5SnEm4Jr_7pFw]
        at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:308) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:282) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.transport.TransportService$6.handleException(TransportService.java:884) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:379) [opensearch-security-2.9.0.0.jar:2.9.0.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1504) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.transport.TransportService$9.run(TransportService.java:1356) [opensearch-2.9.0.jar:2.9.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.9.0.jar:2.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.NodeDisconnectedException: [node018][10.67.7.38:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected

Ivan.A · May 14, 2024, 12:22pm

Hi, any updates about his?

jaimie.livingston · May 22, 2024, 9:48pm

No updates. No change to cluster response to a dropped node. The cluster will eventually recover, but it takes a long time.

We currently operate on v2.9.0 and plan to upgrade the cluster to 2.15 in July or August, after a thorough evaluation.

~Jaimie

jaimie.livingston · December 2, 2024, 7:44pm

For those interested, the noted issue with the nodes dropping “went away” with an upgrade to v2.17.

I still do not know the underlying cause of the issue, but it’s no longer an active concern.

~Jaimie

Topic		Replies	Views
OpenDistro cluster becomes unstable after losing a node OpenDistro	8	979	January 11, 2022
Cluster was running fine and randomly nodes losing connection OpenSearch	25	160	June 25, 2025
Opensearch Cluster Highavailability testing question OpenSearch	2	140	April 3, 2024
3 nodes opensearch cluster, cannot start with only 2 nodes OpenSearch	20	1418	March 28, 2024
Data Node Not Joining Cluster After Upgrade OpenSearch troubleshoot , configure , install , upgrade	0	159	September 11, 2024

Dropping 1 node of cluster results unstable cluster and all shards being unassigned

Related topics