Recovering cluster from no-primary state

patelsmit32123 · March 25, 2025, 3:55am

Hey Team,

We have a cluster with 3 master nodes [ A (primary), B, C ]. Now if A goes down and B becomes primary, then B also goes down with only C remaining and C cannot elect itself as primary as there is no quorum. Now adding new node D is not helping since C is still looking for A and B only, probably because D is not getting added to the cluster as there is no primary. We also tried re-bootstrapping the cluster by adding necessary configs of seed host and initial_master_nodes but C is still looking for A and B as per below log (it is able to discover D though as per log).

I think active clusters are not allowed to be re-bootstrapped as the bootstrap config is just ignored if lastAcceptedConfiguration is persisted. We cannot change any voting configuration since primary master is not available. How do we recover cluster/data from such a situation?

https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/or[…]nsearch/cluster/coordination/ClusterFormationFailureHelper.java

{"level":"WARN","timestamp":"2025-03-24T10:06:13,699","thread":"opensearch[esdock-opensearch-resiliency-drill-staging-cluster-phx-tolov-funar][generic][T#2]","file":"ClusterFormationFailureHelper.java", "line":"132","message":"[esdock-opensearch-resiliency-drill-staging-cluster-phx-tolov-funar] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [2Le6_capSumt4ZPNq-sN0w, -cYw6Gr1TzuUaHdYxRxbFA, ZOLTmENMQkmd3c2b0ReufA], have discovered [{esdock-opensearch-resiliency-drill-staging-cluster-phx-tolov-funar}{ZOLTmENMQkmd3c2b0ReufA}{UCqAUJ1cREGw2A0dcJpHXA}{10.157.69.105}{10.157.69.105:25835}{m}{rack=9dfae5c253a82a43d71ed4f16ac58245013d3652733635e1812c585d1ccfc708, zone=phx51, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}, {esdock-opensearch-resiliency-drill-staging-cluster-phx-pugog-taros}{EJA6w0yFQ0a_cRP8xgyt-Q}{gtcalnj-QsaxScS9TpqP7g}{10.156.57.39}{10.156.57.39:27357}{m}{rack=f044798c2200130dca303c6fdd62fff1fc6c3aabc1a02923b8d75baf7972a7fd, zone=phx50, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-0}] which is not a quorum; discovery will continue using [10.76.227.18:29299, 10.156.57.39:27357] from hosts providers and [{esdock-opensearch-resiliency-drill-staging-cluster-phx-tolov-funar}{ZOLTmENMQkmd3c2b0ReufA}{UCqAUJ1cREGw2A0dcJpHXA}{10.157.69.105}{10.157.69.105:25835}{m}{rack=9dfae5c253a82a43d71ed4f16ac58245013d3652733635e1812c585d1ccfc708, zone=phx51, shard_indexing_pressure_enabled=true, isolation-group=isolation-group-2}] from last-known cluster state; node term 8, last-accepted version 1881 in term 8"}```

pablo · April 24, 2025, 6:23pm

@patelsmit32123 Did you resolve your issue?

Topic		Replies	Views
Error cluster-manager not discovered or elected yet OpenSearch troubleshoot	2	1162	December 22, 2023
Opensearch Cluster Highavailability testing question OpenSearch	2	139	April 3, 2024
Master nodes not able to commit cluster state and cluster stuck till restart of master nodes OpenSearch troubleshoot	1	933	February 23, 2023
Node comeback up with stale metadata OpenSearch troubleshoot	0	41	July 9, 2024
When the master node down, can the other node become master node automatically? OpenSearch discuss	3	995	February 5, 2024

Recovering cluster from no-primary state

Related topics