Data Node Not Joining Cluster After Upgrade

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Opensearch 1.2.4 upgrading to Opensearch 2.16.0

Describe the issue:

Our cluster is run from Rhel 8 linux nodes. We have been performing a rolling upgrade using Tarball. The following approach is being used to upgrade one node at a time.

  1. Disable shard allocation
    PUT /_cluster/settings?pretty
    {
    “persistent”: {
    “cluster.routing.allocation.enable”: “primaries”
    }
    }

  2. Stop opensearch on one data node.

  3. Unpack tarball.

tar -xvf opensearch-2.16.0-linux-x64.tar.gz
  1. Re-point directory to access legacy data.
  2. Re-point ExecStart to the new Opensearch 2.16.0 directory
  3. Start opensearch on the data node.

After completing these steps, we are getting the error logs Referenced in the Relevant Logs section

The cluster is unable to find the node and join it back to the cluster.

However

From the nodes perspective, it can reach the cluster just fine.

curl https://datanode.com:9200/_cat/indices returns all the indices just fine
curl https://datanode.com:9200/_cat/health returns cluster health
Running securityadmin.sh tool is able to pull from the cluster and we can even make changes to the cluster.

Scratching my head as to what the issue is as it seems from the data node perspective, everything should be working just fine. Please let me know your thoughts!

Configuration:

15 node cluster (3 master nodes, 12 data nodes)

---
bootstrap.memory_lock: "true"
cluster.initial_master_nodes:
- "master1.com"
- "master2.com"
- "master3.com"
cluster.name: "mycluster"
cluster.routing.allocation.awareness.attributes: "allocationzone"
discovery.seed_hosts:
- "master1.com"
- "master2.com"
- "master3.com"
network.host: "0.0.0.0"
node.attr.allocationzone: "zone04"
node.ingest: "false"
node.master: "false"
node.name: "data1.com"
plugins.security.allow_default_init_securityindex: "true"
#plugins.security.audit.enable_rest: "false"
#plugins.security.audit.type: "internal_elasticsearch"
plugins.security.authcz.admin_dn:
- "xxxx"
plugins.security.nodes_dn:
- "CN=master1.com"
- "CN=master2.com"
- "CN=master3.com"
- "CN=data1.com"
- "CN=data1.com"
- "CN=data2.com"
- "CN=data3.com"
- "CN=data4.com"
- "CN=data5.com"
- "CN=data6.com"
- "CN=data7.com"
- "CN=data8.com"
- "CN=data9.com"
- "CN=data10.com"
- "CN=data11.com"
plugins.security.restapi.roles_enabled:
- "all_access"
- "security_rest_api_access"
plugins.security.ssl.http.enabled: "true"
plugins.security.ssl.http.enabled_protocols:
- "TLSv1.2"
plugins.security.ssl.http.pemcert_filepath: "data1.pem"
plugins.security.ssl.http.pemkey_filepath: "data1.key"
plugins.security.ssl.http.pemtrustedcas_filepath: "ca.pem"
plugins.security.ssl.transport.enabled_protocols:
- "TLSv1.2"
plugins.security.ssl.transport.enforce_hostname_verification: "false"
plugins.security.ssl.transport.pemcert_filepath: "data1.pem"
plugins.security.ssl.transport.pemkey_filepath: "data1.key"
plugins.security.ssl.transport.pemtrustedcas_filepath: "ca.pem"
plugins.security.ssl.transport.truststore_filepath: "cacerts"
path.data: "/apps/opensearch/data"
path.logs: "/apps/opensearch/logs"

Relevant Logs or Screenshots:

[2024-09-09T23:15:09,596][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [xxxx.com] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: ClusterManagerNotDiscoveredException[null]


        at java.lang.Thread.run(Thread.java:829) ~[?:?]
[2024-09-09T22:49:30,949][INFO ][o.o.c.s.ClusterApplierService] [xxxx.com] cluster-manager node changed {previous [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}], current []}, term: 297, version: 2878506, reason: becoming candidate: onLeaderFailure
[2024-09-09T22:49:30,964][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [xxxxx.com] Cancelling the migration process.
[2024-09-09T22:49:31,438][INFO ][o.o.c.s.ClusterApplierService] [xxxxx.com] cluster-manager node changed {previous [], current [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}]}, term: 297, version: 2878508, reason: ApplyCommitRequest{term=297, version=2878508, sourceNode={xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}}
[2024-09-09T22:49:31,464][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [xxxxx.com] Cancelling the migration process.
[2024-09-09T22:49:31,523][INFO ][o.o.d.PeerFinder         ] [xxxxx.com] setting findPeersInterval to [1s] as node commission status = [true] for local node [{xxxxx.com}{cSN5Fz4rQVujQnKOtwj79w}{Rtarzwn9Rz6T9UNdutt99g}{x.x.x.x}{x.x.x.x:9300}{dr}{allocationzone=zone04, shard_indexing_pressure_enabled=true}]
[2024-09-09T22:49:34,420][INFO ][o.o.c.c.Coordinator      ] [xxxxx.com] cluster-manager node [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.OpenSearchException: node [{xxxxx.com}{7BBs5k34QOGEfIEaOKXr7A}{zhKdwtAlQaOXOAm9mDrf0w}{x.x.x.x}{x.x.x.x:9300}{imr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
        at org.opensearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:329) ~[opensearch-2.16.0.jar:2.16.0]