Node comeback up with stale metadata

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.14.0
Server OS: ubuntu 22.04

The deployment is composed by 3x nodes that run both “cluster manager” and “data” roles.

Describe the issue:
I have a CI run that down scales the cluster to 0 units, then reuse one of the storage devices to try recreate one node. The goal is to be able to test a recovery of the cluster, node-by-node using the existing storage devices.

As this is a CI run, we are not concerned about back up / restore the cluster per-se, but to understand why this scenario works in some runs and not in others and how to make it work consistently.

The storage devices only contain the actual data of the opensearch, not the configuration files of the original cluster.

Now, on some of our runs, we are failing to bring this new, 4th node, back. Whenever it happens, we can see the node keeps trying to search for non-existing peer. It seems it recovered stale metadata and cannot decide to take over as the manager, with:

Jul 09 18:18:14 juju-5134d3-7 opensearch.daemon[17913]: [2024-07-09T18:18:14,670][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-6.4cd] cluster-manager not discovered or elected yet, an election requires a node with id [um1xNA9gTAuGdbum9UOXuA], have discovered [{opensearch-6.4cd}{_peiFFBbQcSXOyH53Pk4zg}{r_17iBF6QJe8yytAWbnFrg}{}{}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=698cda97-fed7-4bc6-81d7-8471e15134d3/opensearch}] which is not a quorum; discovery will continue using [,,,,,, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305,] from hosts providers and [{opensearch-6.4cd}{_peiFFBbQcSXOyH53Pk4zg}{r_17iBF6QJe8yytAWbnFrg}{}{}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=698cda97-fed7-4bc6-81d7-8471e15134d3/opensearch}] from last-known cluster state; node term 8, last-accepted version 180 in term 8

It was also tested setting:


and removing the nodes/0/_state/*.st.

The newly deployed 4th node configs are: /var/snap/opensearch/common/var/lib/opensearch
path.logs: /var/snap/opensearch/common/var/log/opensearch
path.home: /var/snap/opensearch/current/usr/share/opensearch
prometheus.metric_name.prefix: opensearch_
prometheus.indices: 'false'
prometheus.cluster.settings: 'false'
prometheus.nodes.filter: _local
- O=opensearch-xnxz,CN=admin PKCS12 certificates/unit-transport.p12 PKCS12 certificates/ca.p12 unit-transport ca ... ... PKCS12 certificates/unit-http.p12 PKCS12 certificates/ca.p12 unit-http ca ... ... OPTIONAL opensearch-xnxz opensearch-6.4cd
- _site_
- juju-5134d3-7
- data
- ingest
- ml
- coordinating_only
- cluster_manager
node.attr.app_id: 698cda97-fed7-4bc6-81d7-8471e15134d3/opensearch
discovery.seed_providers: file
- opensearch-6.4cd false true true
- all_access
- security_rest_api_access true

The seed_providers is set to file, which uses the unitcast_hosts:

That is the internal IP of the 4th node, as expected.

Relevant Logs or Screenshots: