We are facing frequent node disconnects with the error message \"master not discovered yet: have discovered \"

We are setting up Opensearch on Bare-metal hosts.

We are facing frequent node disconnects with the error message "master not discovered yet: have discovered "

Below are the exceptions we are seeing in the logs.

#1
[2022-09-22T14:15:26,336][WARN ][o.o.c.c.ClusterFormationFailureHelper] [data1] master not discovered yet: have discovered []; discovery will continue using [] from hosts providers and [**] from last-known cluster state; node term 6, last-accepted version 51 in term 6

#2
[2022-09-19T14:09:45,820][DEBUG][o.o.c.c.LeaderChecker ] [] 1 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader []
org.opensearch.transport.RemoteTransportException: [][internal:coordination/fault_detection/leader_check]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [
] has been removed from the cluster

#3
[2022-09-19T14:09:46,825][DEBUG][o.o.c.c.LeaderChecker ] [] 2 consecutive failures (limit [cluster.fault_detection.leader_check.retry_count] is 3) with leader []
org.opensearch.transport.RemoteTransportException: [][internal:coordination/fault_detection/leader_check]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [
] has been removed from the cluster

#4
[2022-09-19T13:29:00,783][WARN ][o.o.c.c.JoinHelper ] [] last failed join attempt was 5.8s ago, failed to join {} with JoinRequest{sourceNode=, minimumTerm=24, optionalJoin=Optional[Join{term=24, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={shard_indexing_pressure_enabled=true}, targetNode={}
org.opensearch.transport.RemoteTransportException: [
][internal:cluster/coordination/join]
Caused by: org.opensearch.transport.ConnectTransportException: [**] general node connection failure

Thanks,
Mukul

Oh boi there has to be a lot to unfold, can you share your opensearch.yml configuration?
Also note you have to update configuration files (whatever the change) on every master node when working with opensearch as a service type of cluster.

cluster.name: OpensearchCluster
network.host: <hostname-1-dns>

http.port: 9200
transport.port: 9300
node.name: master1
discovery.seed_hosts:
   - <hostname-1-dns>
   - <hostname-2-dns>
   - <hostname-3-dns>

cluster.initial_master_nodes:
   - master1
   - master2
   - master3

#plugins.security.disabled: true
plugins.security.ssl.http.enabled: true
bootstrap.memory_lock: true

plugins.security.ssl.transport.keystore_filepath: keystore.jks
plugins.security.ssl.transport.keystore_password: ***
plugins.security.ssl.transport.truststore_filepath: truststore.jks
plugins.security.ssl.transport.truststore_password: ***
plugins.security.ssl.transport.truststore_type: jks
plugins.security.ssl.transport.keystore_type: jks
plugins.security.ssl.http.keystore_filepath: keystore.jks
plugins.security.ssl.http.keystore_password: ***
plugins.security.ssl.http.truststore_filepath: truststore.jks
plugins.security.ssl.http.truststore_password: ***
plugins.security.ssl.http.keystore_type: jks
plugins.security.ssl.http.truststore_type: jks

plugins.security.ssl.transport.keystore_alias: #CN of keystore
plugins.security.ssl.transport.truststore_alias: #CN of truststore

plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]

plugins.security.authcz.admin_dn: #admin details
   - ''

plugins.security.nodes_dn: #List all nodes detail. 
   - ''

this is the opensearch.yml of one of the master nodes in the cluster. the problem arises when we are trying to add a data node to the cluster.

Hi @Mukul, We are encountering a similar issue. Have you been able to find a solution? If so, could you please share what worked for you?

Hey, I was off the grid for quite some time in Opensearch comunity. @datapal
By any chance do you have every node’s IP or DNS in discovery.seed_hosts section? (It has to be copied for every host opensearch.yml configuration Creating a cluster - OpenSearch documentation)
The problem with running as a service is, that every master node (in my case i did it for every node) has to have IP’s or domain names of nodes they need to form cluster with.

1 Like

I have provided the service names in the discovery.seed_hosts section. In my case, the issue seems to be with the docker network as I am setting up the cluster on Swarm.

Did you add all the data nodes to nodes_dn?