OpenSearch Node attribute change rejoining issue

Versions: OpenSearch 3.2 OS/Environment: AWS EC2 (Self-Managed), EBS Persistent Storage

Issue Description: After enabling Zone Awareness in the cluster configuration, existing nodes fail to rejoin the cluster upon restart.

Configuration Context: I am running a self-managed OpenSearch cluster on AWS EC2 using EBS volumes for persistent storage. Both Data nodes and Master nodes utilise persistent storage. Upon EC2 startup, a setup script mounts the existing EBS volume and dynamically configures opensearch.yml and jvm.options with updated parameters (such as the new private IP) if necessary.

Current opensearch.yml Configuration:

YAML

cluster.name: opensearch-cluster
network.host: 0.0.0.0
plugins.security.ssl.transport.pemcert_filepath: node.pem
plugins.security.ssl.transport.pemkey_filepath: node-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: root-ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: node.pem
plugins.security.ssl.http.pemkey_filepath: node-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: root-ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
CN=admin,OU=<xxx>,O=<xxx> Ltd,L=<xxx>,ST=<xxx>,C=<xxx>
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled:
all_access
security_rest_api_access
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices:
.plugins-ml-agent
.plugins-ml-config
.plugins-ml-connector
.plugins-ml-controller
.plugins-ml-model-group
.plugins-ml-model
.plugins-ml-task
.plugins-ml-conversation-meta
.plugins-ml-conversation-interactions
.plugins-ml-memory-meta
.plugins-ml-memory-message
.plugins-ml-stop-words
.opendistro-alerting-config
.opendistro-alerting-alert*
.opendistro-anomaly-results*
.opendistro-anomaly-detector*
.opendistro-anomaly-checkpoints
.opendistro-anomaly-detection-state
.opendistro-reports-*
.opensearch-notifications-*
.opensearch-notebooks
.opensearch-observability
.ql-datasources
.opendistro-asynchronous-search-response*
.replication-metadata-store
.opensearch-knn-models
.geospatial-ip2geo-data*
.plugins-flow-framework-config
.plugins-flow-framework-templates
.plugins-flow-framework-state
.plugins-search-relevance-experiment
.plugins-search-relevance-judgment-cache
node.max_local_storage_nodes: 3
node.name: data-node-b0
network.publish_host: 10.7.138.129
http.port: 9200
discovery.seed_providers: ec2
discovery.ec2.tag.StackName: <xxx>
discovery.ec2.tag.NodeType: master,data,client
discovery.ec2.endpoint: ec2.eu-west-2.amazonaws.com
discovery.seed_hosts:
10.7.137.211
10.7.139.229
10.7.138.180
cluster.initial_cluster_manager_nodes:
10.7.137.211
10.7.139.229
10.7.138.180
path.data: /usr/share/opensearch/data
path.logs: /usr/share/opensearch/logs
bootstrap.memory_lock: ā€˜true’
plugins.security.nodes_dn:
CN=<xxx>,OU=<xxx>,O=<xxx>,L=<xxx>,ST=<xxx>,C=<xxx>
plugins.security.ssl_cert_reload_enabled: true
plugins.security.ssl.http.enforce_cert_reload_dn_verification: false
plugins.security.ssl.transport.enforce_cert_reload_dn_verification: false
node.roles:
data

Scenario & Analysis: Because data is stored on EBS, the node state (including the Node ID) persists across EC2 instance refreshes, even though the specific EC2 private IP address changes. Typically, this process works without issue because the node.name and attributes remain consistent.

However, after updating the configuration to add Zone Awareness, the nodes are throwing errors upon trying to rejoin. I have attempted troubleshooting by both changing the node.name and keeping it identical to the previous state, but the error persists.

Error Log:

Plaintext

2025-11-27T08:02:15,515][INFO ][o.o.c.c.JoinHelper       ] [<xxx>-data-node-a1-rahul] failed to join {<xxx>-prod-opensearch-master-eu-west-2c-1}{8Z3oAsH0QAyuZE3qF1E6Tw}{v9JsNesHQ5-hSBb5eCjXbQ}{10.7.139.199}{10.7.139.199:9300}{m}{zone=eu-west-2c, shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={<xxx>-data-node-a1-rahul}{Hhf3OOLiRBCpOOWoGy0P5Q}{WKYeyW5jS7Oso6LGCgvQHw}{10.7.137.93}{10.7.137.93:9300}{d}{zone=eu-west-2a, shard_indexing_pressure_enabled=true}, minimumTerm=40, optionalJoin=Optional.empty}
org.opensearch.transport.RemoteTransportException: [<xxx>-prod-opensearch-master-eu-west-2c-1][172.17.0.2:9300][internal:cluster/coordination/join]
Caused by: java.lang.IllegalArgumentException: can’t add node {<xxx>-data-node-a1-rahul}{Hhf3OOLiRBCpOOWoGy0P5Q}{WKYeyW5jS7Oso6LGCgvQHw}{10.7.137.93}{10.7.137.93:9300}{d}{zone=eu-west-2a, shard_indexing_pressure_enabled=true}, found existing node {<xxx>-data-node-a1}{Hhf3OOLiRBCpOOWoGy0P5Q}{5nhvRfh0SkS1jsJfaackZQ}{10.7.137.241}{10.7.137.241:9300}{d}{shard_indexing_pressure_enabled=true} with the same id but is a different node instance
at org.opensearch.cluster.node.DiscoveryNodes$Builder.add(DiscoveryNodes.java:736) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:232) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:197) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService.executeTasks(ClusterManagerService.java:890) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService.calculateTaskOutputs(ClusterManagerService.java:441) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService.runTasks(ClusterManagerService.java:301) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService$Batcher.run(ClusterManagerService.java:214) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:206) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:264) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:299) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:262) ~[opensearch-3.2.0.jar:3.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) ~[?:?]
at java.lang.Thread.run(Thread.java:1447) [?:?]

Hypothesis & Question: It appears OpenSearch is rejecting the join request because the Node ID (which is persisted in the _state directory on EBS) remains the same, but the node attributes (specifically the introduction of the zone attribute) have changed. The Cluster Manager seems to flag this as a conflict where a new instance is trying to claim the ID of an existing node.

How can I resolve this issue so that the node can rejoin with the new Zone Awareness attributes? I need to retain the existing shards on the EBS volume, so a full wipe/reindex is not a viable option.

@Dhruv testing this locally using VMs, you are able to delete one node, copy the data directory to new one and start again with additional attributed and even a new name, retaining the data on that node.

In your setup, are the new nodes being create while the old nodes are still up? Is it all nodes or master/data only? When the old nodes goes down the state in master should be updated to remove that node, allowing the new node to join if necessary using the same data directory.

The only way I can reproduce this is starting a new node using the same data directory, while the old node is still active.

1 Like

Hi @Anthony ,

Thanks for your continued help.
I think you found the issue. To answer your question: Physically, the old node is down, but the Cluster State hasn’t realized it yet.

Since we use EBS, the old EC2 instance must be fully terminated (or the volume detached) before the new instance can attach that volume. However, because we use a ā€œGolden AMIā€ that launches extremely fast, the new node attempts to join before the cluster has timed out the old node.

So, while the old node isn’t actually active, the master node thinks it is. The replacement is happening faster than the cluster’s failure detection. I’ll try to resolve this by adding a wait time or explicitly forcing the old node out of the cluster state during the boot process.

1 Like

Hi @Anthony,

I wanted to close the loop on this. We successfully resolved the issue and completed the Zone Awareness migration. It turned out to be a combination of TCP timeouts (the ā€œGhost Nodeā€ issue) and a Cluster Manager logic crash during the mixed-state rollout.

For anyone else facing this on AWS EC2 with EBS, here is the complete solution summary:

1. The ā€œGhost Nodeā€ Issue (TCP Black Hole)

Symptoms: New nodes failed to join with ā€œSame ID but different node instanceā€ because the Master still had an active TCP connection to the terminated instance. Linux default TCP retries kept this alive for ~15 minutes.

The Fix: We forced OpenSearch to use Application-Level pings instead of relying on the OS.

  • Config Change: Added transport.ping_schedule: "5s" to opensearch.yml. This forces the Master to kill dead connections in ~15 seconds instead of 15 minutes.

  • Fault Detection: Tuned cluster.fault_detection.leader_check.interval to 2s and timeouts to 10s for faster reaction.

2. Script Logic Update

We found that checking _cat/nodes in our startup script was insufficient because it filters out ā€œfailingā€ nodes.

The Fix: We updated our Python startup script to query _cluster/state/nodes instead. This reveals the ā€œraw truthā€ of the Master’s memory. The script now blocks container startup until the node ID is truly gone from the Cluster State.

3. The ā€œNullPointerā€ Crash during Rollout

Symptoms: When we enabled cluster.routing.allocation.awareness.attributes: zone on the Masters, the Cluster Manager crashed with a NullPointerException during the rolling restart.

Root Cause: The Master tried to enforce awareness logic on the Old Data Nodes (which didn’t have the zone attribute yet), resulting in a null comparison crash.

The Fix:

  1. We temporarily forced the setting to an Empty String via API to stop the crashing while keeping the config in the file:

    PUT _cluster/settings
    { "transient": { "cluster.routing.allocation.awareness.attributes": "" } }
    
    
  2. We finished the rolling restart of ALL data nodes (so everyone picked up the node.attr.zone config).

  3. Once the fleet was updated, we re-enabled the setting:

    PUT _cluster/settings
    { "persistent": { "cluster.routing.allocation.awareness.attributes": "zone" } }
    
    

4. Data Safety

To prevent data corruption or the accidental creation of empty data/nodes/1 directories if a lock file lingered on EBS:

The Fix: We explicitly set node.max_local_storage_nodes: 1 in the config.

While 1 is actually sufficient if you’re in production, 3rd gives you confidence to go ahead with deployment and 2nd and 4th are just good to have.

Note:- Used AI to format the message.

1 Like