Cluster is frequently lost with master not discovered error and Timeout error

premnath · April 29, 2021, 12:28pm

We have developed Custom plugin (Rollup) that performs rollup (Aggregations) on indexes and as part of this we use AdminClient and ClusterAdminClient API to create aliases and run some composite aggregation queries on ES. This plugin creates some jobs that are scheduled via Open distro Job Scheduler.
Till now we were running on 7.2 version of elasticsearch and we did not had any issues with Plugin. But we are now testing on upgrading to Open distro 1.13.2 and as part of that we see whenever Rollup job is executed, after some queries Cluster is lost and master discovery fails as well.
Cluster Publication also starts timing out.

I have 3 node cluster with all 3 nodes are master-eligible,data and ingest nodes. Can you please suggest why cluster is becoming unhealthy and queries start timing out when plugin is executing queries.

Below errors are seen:
[2021-04-27T11:14:36,397][DEBUG][o.e.d.PeerFinder ] [elasticsearch-workernode-2] Peer{transportAddress=192.168.244.73:9300, discoveryNode={elasticsearch-workernode-0}{TKv7M-4mTTS-kk_zJS4bKA}{lYRMBiPxTveaA0srz0mV7w}{192.168.244.73}{192.168.244.73:9300}{dimr}, peersRequestInFlight=false} peers request failed
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-workernode-0][192.168.244.73:9300][internal:discovery/request_peers] request_id [14999] timed out after [3002ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1083) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
[2021-04-27T11:14:37,251][WARN ][r.suppressed ] [elasticsearch-workernode-2] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
[2021-04-27T11:14:39,263][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elasticsearch-workernode-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [TKv7M-4mTTS-kk_zJS4bKA, tRlF2GySSz2yfwsFqFHJdQ, lH8lhxLUTdWbOsPf_9Wj1g], have discovered [{elasticsearch-workernode-2}{lH8lhxLUTdWbOsPf_9Wj1g}{37Z2CPO7Sv-dpWU6jeGcag}{192.168.244.201}{192.168.244.201:9300}{dimr}, {elasticsearch-workernode-0}{TKv7M-4mTTS-kk_zJS4bKA}{lYRMBiPxTveaA0srz0mV7w}{192.168.244.73}{192.168.244.73:9300}{dimr}, {elasticsearch-workernode-1}{tRlF2GySSz2yfwsFqFHJdQ}{DnW3CjoqSQa3Jz7ifugwkg}{192.168.244.9}{192.168.244.9:9300}{dimr}] which is a quorum; discovery will continue using [192.168.244.9:9300, 192.168.244.73:9300] from hosts providers and [{elasticsearch-workernode-0}{TKv7M-4mTTS-kk_zJS4bKA}{lYRMBiPxTveaA0srz0mV7w}{192.168.244.73}{192.168.244.73:9300}{dimr}, {elasticsearch-workernode-1}{tRlF2GySSz2yfwsFqFHJdQ}{DnW3CjoqSQa3Jz7ifugwkg}{192.168.244.9}{192.168.244.9:9300}{dimr}, {elasticsearch-workernode-2}{lH8lhxLUTdWbOsPf_9Wj1g}{37Z2CPO7Sv-dpWU6jeGcag}{192.168.244.201}{192.168.244.201:9300}{dimr}] from last-known cluster state; node term 1, last-accepted version 169 in term 1
[2021-04-27T11:14:44,399][DEBUG][o.e.d.PeerFinder ] [elasticsearch-workernode-2] Peer{transportAddress=192.168.244.9:9300, discoveryNode={elasticsearch-workernode-1}{tRlF2GySSz2yfwsFqFHJdQ}{DnW3CjoqSQa3Jz7ifugwkg}{192.168.244.9}{192.168.244.9:9300}{dimr}, peersRequestInFlight=false} peers request failed

Topic		Replies	Views
Possible issue? Index Management	3	1258	January 17, 2020
Master nodes not able to commit cluster state and cluster stuck till restart of master nodes OpenSearch troubleshoot	1	924	February 23, 2023
Reindex API Unexpected Timeouts OpenDistro	1	2142	December 29, 2021
Problem with timeout 2 min Open Source Elasticsearch and Kibana	1	2103	February 7, 2023
Getting the below issue during setting up elasticsearch cluster in docker swarm Open Source Elasticsearch and Kibana	1	900	February 7, 2023

Cluster is frequently lost with master not discovered error and Timeout error

Related topics