Node comeback up with stale metadata

pguimaraes · July 9, 2024, 6:26pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.14.0
Server OS: ubuntu 22.04

The deployment is composed by 3x nodes that run both “cluster manager” and “data” roles.

Describe the issue:
I have a CI run that down scales the cluster to 0 units, then reuse one of the storage devices to try recreate one node. The goal is to be able to test a recovery of the cluster, node-by-node using the existing storage devices.

As this is a CI run, we are not concerned about back up / restore the cluster per-se, but to understand why this scenario works in some runs and not in others and how to make it work consistently.

The storage devices only contain the actual data of the opensearch, not the configuration files of the original cluster.

Now, on some of our runs, we are failing to bring this new, 4th node, back. Whenever it happens, we can see the node keeps trying to search for non-existing peer. It seems it recovered stale metadata and cannot decide to take over as the manager, with:

Jul 09 18:18:14 juju-5134d3-7 opensearch.daemon[17913]: [2024-07-09T18:18:14,670][WARN ][o.o.c.c.ClusterFormationFailureHelper] [opensearch-6.4cd] cluster-manager not discovered or elected yet, an election requires a node with id [um1xNA9gTAuGdbum9UOXuA], have discovered [{opensearch-6.4cd}{_peiFFBbQcSXOyH53Pk4zg}{r_17iBF6QJe8yytAWbnFrg}{10.115.236.28}{10.115.236.28:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=698cda97-fed7-4bc6-81d7-8471e15134d3/opensearch}] which is not a quorum; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 127.0.0.1:9300] from hosts providers and [{opensearch-6.4cd}{_peiFFBbQcSXOyH53Pk4zg}{r_17iBF6QJe8yytAWbnFrg}{10.115.236.28}{10.115.236.28:9300}{coordinating_onlydimml}{shard_indexing_pressure_enabled=true, app_id=698cda97-fed7-4bc6-81d7-8471e15134d3/opensearch}] from last-known cluster state; node term 8, last-accepted version 180 in term 8

It was also tested setting:

discovery.type=single-node

and removing the nodes/0/_state/*.st.

Configuration:
The newly deployed 4th node configs are:

path.data: /var/snap/opensearch/common/var/lib/opensearch
path.logs: /var/snap/opensearch/common/var/log/opensearch
path.home: /var/snap/opensearch/current/usr/share/opensearch
prometheus.metric_name.prefix: opensearch_
prometheus.indices: 'false'
prometheus.cluster.settings: 'false'
prometheus.nodes.filter: _local
plugins.security.authcz.admin_dn:
- O=opensearch-xnxz,CN=admin
plugins.security.ssl.transport.keystore_type: PKCS12
plugins.security.ssl.transport.keystore_filepath: certificates/unit-transport.p12
plugins.security.ssl.transport.truststore_type: PKCS12
plugins.security.ssl.transport.truststore_filepath: certificates/ca.p12
plugins.security.ssl.transport.keystore_alias: unit-transport
plugins.security.ssl.transport.truststore_alias: ca
plugins.security.ssl.transport.keystore_password: ...
plugins.security.ssl.transport.truststore_password: ...
plugins.security.ssl.http.keystore_type: PKCS12
plugins.security.ssl.http.keystore_filepath: certificates/unit-http.p12
plugins.security.ssl.http.truststore_type: PKCS12
plugins.security.ssl.http.truststore_filepath: certificates/ca.p12
plugins.security.ssl.http.keystore_alias: unit-http
plugins.security.ssl.http.truststore_alias: ca
plugins.security.ssl.http.keystore_password: ...
plugins.security.ssl.http.truststore_password: ...
plugins.security.ssl.http.clientauth_mode: OPTIONAL
cluster.name: opensearch-xnxz
node.name: opensearch-6.4cd
network.host:
- _site_
- juju-5134d3-7
- 10.115.236.28
network.publish_host: 10.115.236.28
node.roles:
- data
- ingest
- ml
- coordinating_only
- cluster_manager
node.attr.app_id: 698cda97-fed7-4bc6-81d7-8471e15134d3/opensearch
discovery.seed_providers: file
cluster.initial_cluster_manager_nodes:
- opensearch-6.4cd
plugins.security.disabled: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.transport.enforce_hostname_verification: true
plugins.security.restapi.roles_enabled:
- all_access
- security_rest_api_access
plugins.security.unsupported.restapi.allow_securityconfig_modification: true

The seed_providers is set to file, which uses the unitcast_hosts:

10.115.236.28

That is the internal IP of the 4th node, as expected.

Relevant Logs or Screenshots:
https://pastebin.ubuntu.com/p/j9Y6H7JTH3/

Topic		Replies	Views
Error cluster-manager not discovered or elected yet OpenSearch troubleshoot	2	1127	December 22, 2023
Recovering cluster from no-primary state OpenSearch	1	19	April 24, 2025
How to repurpose a node in docker OpenSearch	3	347	May 31, 2024
Data Node Not Joining Cluster After Upgrade OpenSearch troubleshoot , configure , install , upgrade	0	145	September 11, 2024
When the master node down, can the other node become master node automatically? OpenSearch discuss	3	973	February 5, 2024

Node comeback up with stale metadata

Related topics