Nodes fall out of the cluster es 7.9.1

Hello, we have the following problem: nodes can periodically fall out of the cluster

[2020-11-19T09:00:37,415][INFO ][o.e.c.s.MasterService    ] [h1-es03] node-left[{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr} reason: followers check retry count exceeded], term: 87,
 version: 236727, delta: removed {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}
[2020-11-19T09:00:39,763][INFO ][o.e.c.s.ClusterApplierService] [h1-es03] removed {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}, term: 87, version: 236727, reason: Publication{term
=87, version=236727}
[2020-11-19T09:00:47,890][INFO ][o.e.c.s.MasterService    ] [h1-es03] node-join[{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr} join existing leader], term: 87, version: 236730, delta:
 added {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}
[2020-11-19T09:00:52,713][INFO ][o.e.c.s.ClusterApplierService] [h1-es03] added {{h1-es02}{qgmMV2UbT-ScN9uRr6YM8g}{klj1K8UMRHGtYpYozIuqsA}{192.168.57.102}{192.168.57.102:9300}{dimr}}, term: 87, version: 236730, reason: Publication{term=8
7, version=236730}

3 nodes in a cluster, each can be a master and is a datanode. Elasticsearch 7.9.1 build opendistro from amazon.
In addition, we write to Elasticsearch using Apache Metron and when we try to write sometimes, we get errors:

java.io.IOException: listener timeout after waiting for [30000] ms

Do you have the logs for h1-es02 at that time to see what happened with the node?

We faced with this problem again. Now I have log, but can’t understang how to load it to this forum. Seems that I can’t change my initial message, and can’t attach to new message any file that not a picture.

We had the same situation. The ES version is 7.10.2.
The master node show “data node node-left”, but the data node show “master not discovered yet”

Server layout
172.16.22.153 esnode1
172.16.22.154 esnode2
172.16.22.155 esnode3
172.16.22.190 esnode4
172.16.22.191 esnode5
172.16.22.192 esnode6
172.16.22.193 esnode7
172.16.22.194 esnode8
172.16.22.195 esnode9

Related logs:
Master node(esnode1)
[2021-03-25T13:58:31,547][INFO ][o.e.c.c.C.CoordinatorPublication] [esnode1] after [10s] publication of cluster state version [4502] is still waiting for {esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{.191}{172.16.22.191:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode6}{MsHrFuhtR2yp0JGSRsqS5w}{lA8_OIWuQm2gZxdXsEDEdA}{172.16.22.192}{172.16.22.192:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode7}{Yj-61cgOQ–50f8cila67Q}{BSFsIQiSSKSPFUCQ5g65dQ}{172.16.22.193}{172.16.22.193:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode4}{1jH5VI7PQbuLNVWtxhvw8Q}{RpBC6FawS82nBTCjvkh9LQ}{172.16.22.190}{172.16.22.190:9300}{dir} [SENT_PUBLISH_REQUEST]
[2021-03-25T13:58:51,549][WARN ][o.e.c.c.C.CoordinatorPublication] [esnode1] after [30s] publication of cluster state version [4502] is still waiting for {esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{172.16.22.191}{172.16.22.191:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode6}{MsHrFuhtR2yp0JGSRsqS5w}{lA8_OIWuQm2gZxdXsEDEdA}{172.16.22.192}{172.16.22.192:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode7}{Yj-61cgOQ–50f8cila67Q}{BSFsIQiSSKSPFUCQ5g65dQ}{172.16.22.193}{172.16.22.193:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode4}{1jH5VI7PQbuLNVWtxhvw8Q}{RpBC6FawS82nBTCjvkh9LQ}{172.16.22.190}{172.16.22.190:9300}{dir} [SENT_PUBLISH_REQUEST]
[2021-03-25T13:58:51,552][INFO ][o.e.c.r.a.AllocationService] [esnode1] updating number_of_replicas to [4] for indices [.opendistro_security]
[2021-03-25T13:58:51,556][INFO ][o.e.c.s.MasterService ] [esnode1] node-left[{esnode8}{HeEjBS5JSCSYeP2zr2MPWA}{nHAb6LFYT6-YjtXO72CO2g}{172.16.22.194}{172.16.22.194:9300}{dir} reason: followers check retry count exceeded], term: 316, version: 4503, delta: removed {{esnode8}{HeEjBS5JSCSYeP2zr2MPWA}{nHAb6LFYT6-YjtXO72CO2g}{172.16.22.194}{172.16.22.194:9300}{dir}}
[2021-03-25T13:59:01,558][INFO ][o.e.c.c.C.CoordinatorPublication] [esnode1] after [10s] publication of cluster state version [4503] is still waiting for {esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{172.16.22.191}{172.16.22.191:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode6}{MsHrFuhtR2yp0JGSRsqS5w}{lA8_OIWuQm2gZxdXsEDEdA}{172.16.22.192}{172.16.22.192:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode7}{Yj-61cgOQ–50f8cila67Q}{BSFsIQiSSKSPFUCQ5g65dQ}{172.16.22.193}{172.16.22.193:9300}{dir} [SENT_PUBLISH_REQUEST], {esnode4}{1jH5VI7PQbuLNVWtxhvw8Q}{RpBC6FawS82nBTCjvkh9LQ}{172.16.22.190}{172.16.22.190:9300}{dir} [SENT_PUBLISH_REQUEST]
[2021-03-25T13:59:21,558][INFO ][o.e.c.s.ClusterApplierService] [esnode1] removed {{esnode8}{HeEjBS5JSCSYeP2zr2MPWA}{nHAb6LFYT6-YjtXO72CO2g}{172.16.22.194}{172.16.22.194:9300}{dir}}, term: 316, version: 4503, reason: Publication{term=316, version=4503}

Data node(esnode5)
[2021-03-25T14:00:24,436][INFO ][o.e.c.c.Coordinator ] [esnode5] master node [{esnode1}{raDLHjOiTYaY_5ckIjnLVA}{VlAg-gG5Q72y0KTORWm-uQ}{172.16.22.153}{172.16.22.153:9300}{imr}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{esnode1}{raDLHjOiTYaY_5ckIjnLVA}{VlAg-gG5Q72y0KTORWm-uQ}{172.16.22.153}{172.16.22.153:9300}{imr}] failed [3] consecutive checks
at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.2.jar:7.10.2]
Caused by: org.elasticsearch.transport.RemoteTransportException: [esnode1][172.16.22.153:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{esnode5}{RPzC_iENSOiSynpEvT0zag}{T4D5QV6jRvu43I_puZ2iXA}{172.16.22.191}{172.16.22.191:9300}{dir}] has been removed from the cluster

Pls help identify root cause. Thanks,
TM

Update this issue. We found another post from github opendistro security issue #378.
The same situation, user login using AD, then node crash. Reference below link.

FYR.
TM