Nodes fall out of the cluster es 7.9.1

ThreatInter · November 23, 2020, 3:53pm

Hello, we have the following problem: nodes can periodically fall out of the cluster

[2020-11-19T09:00:37,415][INFO ][o.e.c.s.MasterService    ] [h1-es03] node-left[{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr} reason: followers check retry count exceeded], term: 87,
 version: 236727, delta: removed {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}
[2020-11-19T09:00:39,763][INFO ][o.e.c.s.ClusterApplierService] [h1-es03] removed {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}, term: 87, version: 236727, reason: Publication{term
=87, version=236727}
[2020-11-19T09:00:47,890][INFO ][o.e.c.s.MasterService    ] [h1-es03] node-join[{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr} join existing leader], term: 87, version: 236730, delta:
 added {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}
[2020-11-19T09:00:52,713][INFO ][o.e.c.s.ClusterApplierService] [h1-es03] added {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}, term: 87, version: 236730, reason: Publication{term=8
7, version=236730}

3 nodes in a cluster, each can be a master and is a datanode. Elasticsearch 7.9.1 build opendistro from amazon.
In addition, we write to Elasticsearch using Apache Metron and when we try to write sometimes, we get errors:

java.io.IOException: listener timeout after waiting for [30000] ms

tony · November 25, 2020, 8:45am

Do you have the logs for h1-es02 at that time to see what happened with the node?

ThreatInter · December 14, 2020, 6:07am

We faced with this problem again. Now I have log, but can’t understang how to load it to this forum. Seems that I can’t change my initial message, and can’t attach to new message any file that not a picture.

tmchang · March 25, 2021, 3:46pm

We had the same situation. The ES version is 7.10.2.
The master node show “data node node-left”, but the data node show “master not discovered yet”

Server layout
172.16.22.153 esnode1
172.16.22.154 esnode2
172.16.22.155 esnode3
172.16.22.190 esnode4
172.16.22.191 esnode5
172.16.22.192 esnode6
172.16.22.193 esnode7
172.16.22.194 esnode8
172.16.22.195 esnode9

Related logs:
Master node(esnode1)
[2021-03-25T13:58:31,547][INFO ][o.e.c.c.C.CoordinatorPublication] [esnode1] after [10s] publication of cluster state version [4502] is still waiting for {esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{.191}{172.16.22.191:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode6}{MsHrFuhtR2yp0JGSRsqS5w}{lA8_OIWuQm2gZxdXsEDEdA}{172.16.22.192}{172.16.22.192:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode7}{Yj-61cgOQ–50f8cila67Q}{BSFsIQiSSKSPFUCQ5g65dQ}{172.16.22.193}{172.16.22.193:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode4}{1jH5VI7PQbuLNVWtxhvw8Q}{RpBC6FawS82nBTCjvkh9LQ}{172.16.22.190}{172.16.22.190:9300}{dir} [SENT_PUBLISH_REQUEST]
[2021-03-25T13:58:51,549][WARN ][o.e.c.c.C.CoordinatorPublication] [esnode1] after [30s] publication of cluster state version [4502] is still waiting for {esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{172.16.22.191}{172.16.22.191:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode6}{MsHrFuhtR2yp0JGSRsqS5w}{lA8_OIWuQm2gZxdXsEDEdA}{172.16.22.192}{172.16.22.192:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode7}{Yj-61cgOQ–50f8cila67Q}{BSFsIQiSSKSPFUCQ5g65dQ}{172.16.22.193}{172.16.22.193:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode4}{1jH5VI7PQbuLNVWtxhvw8Q}{RpBC6FawS82nBTCjvkh9LQ}{172.16.22.190}{172.16.22.190:9300}{dir} [SENT_PUBLISH_REQUEST]
[2021-03-25T13:58:51,552][INFO ][o.e.c.r.a.AllocationService] [esnode1] updating number_of_replicas to [4] for indices [.opendistro_security]
[2021-03-25T13:58:51,556][INFO ][o.e.c.s.MasterService ] [esnode1] node-left[{esnode8}{HeEjBS5JSCSYeP2zr2MPWA}{nHAb6LFYT6-YjtXO72CO2g}{172.16.22.194}{172.16.22.194:9300}{dir} reason: followers check retry count exceeded], term: 316, version: 4503, delta: removed {{esnode8}{HeEjBS5JSCSYeP2zr2MPWA}{nHAb6LFYT6-YjtXO72CO2g}{172.16.22.194}{172.16.22.194:9300}{dir}}
[2021-03-25T13:59:01,558][INFO ][o.e.c.c.C.CoordinatorPublication] [esnode1] after [10s] publication of cluster state version [4503] is still waiting for {esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{172.16.22.191}{172.16.22.191:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode6}{MsHrFuhtR2yp0JGSRsqS5w}{lA8_OIWuQm2gZxdXsEDEdA}{172.16.22.192}{172.16.22.192:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode7}{Yj-61cgOQ–50f8cila67Q}{BSFsIQiSSKSPFUCQ5g65dQ}{172.16.22.193}{172.16.22.193:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode4}{1jH5VI7PQbuLNVWtxhvw8Q}{RpBC6FawS82nBTCjvkh9LQ}{172.16.22.190}{172.16.22.190:9300}{dir} [SENT_PUBLISH_REQUEST]
[2021-03-25T13:59:21,558][INFO ][o.e.c.s.ClusterApplierService] [esnode1] removed {{esnode8}{HeEjBS5JSCSYeP2zr2MPWA}{nHAb6LFYT6-YjtXO72CO2g}{172.16.22.194}{172.16.22.194:9300}{dir}}, term: 316, version: 4503, reason: Publication{term=316, version=4503}

Data node(esnode5)
[2021-03-25T14:00:24,436][INFO ][o.e.c.c.Coordinator ] [esnode5] master node [{esnode1}{raDLHjOiTYaY_5ckIjnLVA}{VlAg-gG5Q72y0KTORWm-uQ}{172.16.22.153}{172.16.22.153:9300}{imr}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{esnode1}{raDLHjOiTYaY_5ckIjnLVA}{VlAg-gG5Q72y0KTORWm-uQ}{172.16.22.153}{172.16.22.153:9300}{imr}] failed [3] consecutive checks
at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.2.jar:7.10.2]
Caused by: org.elasticsearch.transport.RemoteTransportException: [esnode1][172.16.22.153:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{172.16.22.191}{172.16.22.191:9300}{dir}] has been removed from the cluster

Pls help identify root cause. Thanks,
TM

tmchang · March 29, 2021, 2:40pm

Update this issue. We found another post from github opendistro security issue #378.
The same situation, user login using AD, then node crash. Reference below link.

FYR.
TM

Topic		Replies	Views
Master and follower doesn't get checker messages Open Source Elasticsearch and Kibana	4	1301	July 12, 2021
Nodes crashes, leader check Open Source Elasticsearch and Kibana	3	1010	August 19, 2022
Cluster is frequently lost with master not discovered error and Timeout error Open Source Elasticsearch and Kibana	1	1648	February 7, 2023
Dropping 1 node of cluster results unstable cluster and all shards being unassigned OpenSearch troubleshoot	3	734	December 2, 2024
ElasticSearch Master Nodes Unable to Join Cluster - OpenDistro Security Plugin Open Source Elasticsearch and Kibana	25	7675	November 10, 2020

Nodes fall out of the cluster es 7.9.1

Related topics