Data Node Not Joining Cluster After Upgrade

jsabatel · September 11, 2024, 12:19am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Opensearch 1.2.4 upgrading to Opensearch 2.16.0

Describe the issue:

Our cluster is run from Rhel 8 linux nodes. We have been performing a rolling upgrade using Tarball. The following approach is being used to upgrade one node at a time.

Disable shard allocation
PUT /_cluster/settings?pretty
{
“persistent”: {
“cluster.routing.allocation.enable”: “primaries”
}
}
Stop opensearch on one data node.
Unpack tarball.

tar -xvf opensearch-2.16.0-linux-x64.tar.gz

Re-point directory to access legacy data.
Re-point ExecStart to the new Opensearch 2.16.0 directory
Start opensearch on the data node.

After completing these steps, we are getting the error logs Referenced in the Relevant Logs section

The cluster is unable to find the node and join it back to the cluster.

However

From the nodes perspective, it can reach the cluster just fine.

curl https://datanode.com:9200/_cat/indices returns all the indices just fine
curl https://datanode.com:9200/_cat/health returns cluster health
Running securityadmin.sh tool is able to pull from the cluster and we can even make changes to the cluster.

Scratching my head as to what the issue is as it seems from the data node perspective, everything should be working just fine. Please let me know your thoughts!

Configuration:

15 node cluster (3 master nodes, 12 data nodes)

---
bootstrap.memory_lock: "true"
cluster.initial_master_nodes:
- "master1.com"
- "master2.com"
- "master3.com"
cluster.name: "mycluster"
cluster.routing.allocation.awareness.attributes: "allocationzone"
discovery.seed_hosts:
- "master1.com"
- "master2.com"
- "master3.com"
network.host: "0.0.0.0"
node.attr.allocationzone: "zone04"
node.ingest: "false"
node.master: "false"
node.name: "data1.com"
plugins.security.allow_default_init_securityindex: "true"
#plugins.security.audit.enable_rest: "false"
#plugins.security.audit.type: "internal_elasticsearch"
plugins.security.authcz.admin_dn:
- "xxxx"
plugins.security.nodes_dn:
- "CN=master1.com"
- "CN=master2.com"
- "CN=master3.com"
- "CN=data1.com"
- "CN=data1.com"
- "CN=data2.com"
- "CN=data3.com"
- "CN=data4.com"
- "CN=data5.com"
- "CN=data6.com"
- "CN=data7.com"
- "CN=data8.com"
- "CN=data9.com"
- "CN=data10.com"
- "CN=data11.com"
plugins.security.restapi.roles_enabled:
- "all_access"
- "security_rest_api_access"
plugins.security.ssl.http.enabled: "true"
plugins.security.ssl.http.enabled_protocols:
- "TLSv1.2"
plugins.security.ssl.http.pemcert_filepath: "data1.pem"
plugins.security.ssl.http.pemkey_filepath: "data1.key"
plugins.security.ssl.http.pemtrustedcas_filepath: "ca.pem"
plugins.security.ssl.transport.enabled_protocols:
- "TLSv1.2"
plugins.security.ssl.transport.enforce_hostname_verification: "false"
plugins.security.ssl.transport.pemcert_filepath: "data1.pem"
plugins.security.ssl.transport.pemkey_filepath: "data1.key"
plugins.security.ssl.transport.pemtrustedcas_filepath: "ca.pem"
plugins.security.ssl.transport.truststore_filepath: "cacerts"
path.data: "/apps/opensearch/data"
path.logs: "/apps/opensearch/logs"

Relevant Logs or Screenshots:

[2024-09-09T23:15:09,596][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [xxxx.com] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]


        at java.lang.Thread.run(Thread.java:829) ~[?:?]
[2024-09-09T22:49:30,949][INFO ][o.o.c.s.ClusterApplierService] [xxxx.com] cluster-manager node changed {previous [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}], current []}, term: 297, version: 2878506, reason: becoming candidate: onLeaderFailure
[2024-09-09T22:49:30,964][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [xxxxx.com] Cancelling the migration process.
[2024-09-09T22:49:31,438][INFO ][o.o.c.s.ClusterApplierService] [xxxxx.com] cluster-manager node changed {previous [], current [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}]}, term: 297, version: 2878508, reason: ApplyCommitRequest{term=297, version=2878508, sourceNode={xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}}
[2024-09-09T22:49:31,464][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [xxxxx.com] Cancelling the migration process.
[2024-09-09T22:49:31,523][INFO ][o.o.d.PeerFinder         ] [xxxxx.com] setting findPeersInterval to [1s] as node commission status = [true] for local node [{xxxxx.com}{cSN5Fz4rQVujQnKOtwj79w}{Rtarzwn9Rz6T9UNdutt99g}{x.x.x.x}{x.x.x.x:9300}{dr}{allocationzone=zone04, shard_indexing_pressure_enabled=true}]
[2024-09-09T22:49:34,420][INFO ][o.o.c.c.Coordinator      ] [xxxxx.com] cluster-manager node [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.OpenSearchException: node [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
        at org.opensearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:329) ~[opensearch-2.16.0.jar:2.16.0]

Topic		Replies	Views
Opensearch Cluster going to red during upgrade OpenSearch discuss , troubleshoot , upgrade	0	604	April 5, 2022
Shard fail while rolling upgrade cluster DevOps troubleshoot , upgrade	3	84	October 7, 2024
Opendistro for elasticsearch to OpenSearch Upgrade Issues OpenDistro	1	961	December 30, 2021
Dropping 1 node of cluster results unstable cluster and all shards being unassigned OpenSearch troubleshoot	3	741	December 2, 2024
Opensearch cluster: Node don't see the cluster OpenSearch troubleshoot , configure	10	904	April 17, 2024

Data Node Not Joining Cluster After Upgrade

Related topics