Versions: OpenSearch 3.2 OS/Environment: AWS EC2 (Self-Managed), EBS Persistent Storage
Issue Description: After enabling Zone Awareness in the cluster configuration, existing nodes fail to rejoin the cluster upon restart.
Configuration Context: I am running a self-managed OpenSearch cluster on AWS EC2 using EBS volumes for persistent storage. Both Data nodes and Master nodes utilise persistent storage. Upon EC2 startup, a setup script mounts the existing EBS volume and dynamically configures opensearch.yml and jvm.options with updated parameters (such as the new private IP) if necessary.
Current opensearch.yml Configuration:
YAML
cluster.name: opensearch-cluster
network.host: 0.0.0.0
plugins.security.ssl.transport.pemcert_filepath: node.pem
plugins.security.ssl.transport.pemkey_filepath: node-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: root-ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: node.pem
plugins.security.ssl.http.pemkey_filepath: node-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: root-ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
CN=admin,OU=<xxx>,O=<xxx> Ltd,L=<xxx>,ST=<xxx>,C=<xxx>
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled:
all_access
security_rest_api_access
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices:
.plugins-ml-agent
.plugins-ml-config
.plugins-ml-connector
.plugins-ml-controller
.plugins-ml-model-group
.plugins-ml-model
.plugins-ml-task
.plugins-ml-conversation-meta
.plugins-ml-conversation-interactions
.plugins-ml-memory-meta
.plugins-ml-memory-message
.plugins-ml-stop-words
.opendistro-alerting-config
.opendistro-alerting-alert*
.opendistro-anomaly-results*
.opendistro-anomaly-detector*
.opendistro-anomaly-checkpoints
.opendistro-anomaly-detection-state
.opendistro-reports-*
.opensearch-notifications-*
.opensearch-notebooks
.opensearch-observability
.ql-datasources
.opendistro-asynchronous-search-response*
.replication-metadata-store
.opensearch-knn-models
.geospatial-ip2geo-data*
.plugins-flow-framework-config
.plugins-flow-framework-templates
.plugins-flow-framework-state
.plugins-search-relevance-experiment
.plugins-search-relevance-judgment-cache
node.max_local_storage_nodes: 3
node.name: data-node-b0
network.publish_host: 10.7.138.129
http.port: 9200
discovery.seed_providers: ec2
discovery.ec2.tag.StackName: <xxx>
discovery.ec2.tag.NodeType: master,data,client
discovery.ec2.endpoint: ec2.eu-west-2.amazonaws.com
discovery.seed_hosts:
10.7.137.211
10.7.139.229
10.7.138.180
cluster.initial_cluster_manager_nodes:
10.7.137.211
10.7.139.229
10.7.138.180
path.data: /usr/share/opensearch/data
path.logs: /usr/share/opensearch/logs
bootstrap.memory_lock: ātrueā
plugins.security.nodes_dn:
CN=<xxx>,OU=<xxx>,O=<xxx>,L=<xxx>,ST=<xxx>,C=<xxx>
plugins.security.ssl_cert_reload_enabled: true
plugins.security.ssl.http.enforce_cert_reload_dn_verification: false
plugins.security.ssl.transport.enforce_cert_reload_dn_verification: false
node.roles:
data
Scenario & Analysis: Because data is stored on EBS, the node state (including the Node ID) persists across EC2 instance refreshes, even though the specific EC2 private IP address changes. Typically, this process works without issue because the node.name and attributes remain consistent.
However, after updating the configuration to add Zone Awareness, the nodes are throwing errors upon trying to rejoin. I have attempted troubleshooting by both changing the node.name and keeping it identical to the previous state, but the error persists.
Error Log:
Plaintext
2025-11-27T08:02:15,515][INFO ][o.o.c.c.JoinHelper ] [<xxx>-data-node-a1-rahul] failed to join {<xxx>-prod-opensearch-master-eu-west-2c-1}{8Z3oAsH0QAyuZE3qF1E6Tw}{v9JsNesHQ5-hSBb5eCjXbQ}{10.7.139.199}{10.7.139.199:9300}{m}{zone=eu-west-2c, shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={<xxx>-data-node-a1-rahul}{Hhf3OOLiRBCpOOWoGy0P5Q}{WKYeyW5jS7Oso6LGCgvQHw}{10.7.137.93}{10.7.137.93:9300}{d}{zone=eu-west-2a, shard_indexing_pressure_enabled=true}, minimumTerm=40, optionalJoin=Optional.empty}
org.opensearch.transport.RemoteTransportException: [<xxx>-prod-opensearch-master-eu-west-2c-1][172.17.0.2:9300][internal:cluster/coordination/join]
Caused by: java.lang.IllegalArgumentException: canāt add node {<xxx>-data-node-a1-rahul}{Hhf3OOLiRBCpOOWoGy0P5Q}{WKYeyW5jS7Oso6LGCgvQHw}{10.7.137.93}{10.7.137.93:9300}{d}{zone=eu-west-2a, shard_indexing_pressure_enabled=true}, found existing node {<xxx>-data-node-a1}{Hhf3OOLiRBCpOOWoGy0P5Q}{5nhvRfh0SkS1jsJfaackZQ}{10.7.137.241}{10.7.137.241:9300}{d}{shard_indexing_pressure_enabled=true} with the same id but is a different node instance
at org.opensearch.cluster.node.DiscoveryNodes$Builder.add(DiscoveryNodes.java:736) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:232) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:197) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService.executeTasks(ClusterManagerService.java:890) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService.calculateTaskOutputs(ClusterManagerService.java:441) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService.runTasks(ClusterManagerService.java:301) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.ClusterManagerService$Batcher.run(ClusterManagerService.java:214) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:206) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:264) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:299) ~[opensearch-3.2.0.jar:3.2.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:262) ~[opensearch-3.2.0.jar:3.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619) ~[?:?]
at java.lang.Thread.run(Thread.java:1447) [?:?]
Hypothesis & Question: It appears OpenSearch is rejecting the join request because the Node ID (which is persisted in the _state directory on EBS) remains the same, but the node attributes (specifically the introduction of the zone attribute) have changed. The Cluster Manager seems to flag this as a conflict where a new instance is trying to claim the ID of an existing node.
How can I resolve this issue so that the node can rejoin with the new Zone Awareness attributes? I need to retain the existing shards on the EBS volume, so a full wipe/reindex is not a viable option.