We have developed Custom plugin (Rollup) that performs rollup (Aggregations) on indexes and as part of this we use AdminClient and ClusterAdminClient API to create aliases and run some composite aggregation queries on ES. This plugin creates some jobs that are scheduled via Open distro Job Scheduler.
Till now we were running on 7.2 version of elasticsearch and we did not had any issues with Plugin. But we are now testing on upgrading to Open distro 1.13.2 and as part of that we see whenever Rollup job is executed, after some queries Cluster is lost and master discovery fails as well.
Cluster Publication also starts timing out.
I have 3 node cluster with all 3 nodes are master-eligible,data and ingest nodes. Can you please suggest why cluster is becoming unhealthy and queries start timing out when plugin is executing queries.
Below errors are seen:
[2021-04-27T11:14:36,397][DEBUG][o.e.d.PeerFinder ] [elasticsearch-workernode-2] Peer{transportAddress=192.168.244.73:9300, discoveryNode={elasticsearch-workernode-0}{TKv7M-4mTTS-kk_zJS4bKA}{lYRMBiPxTveaA0srz0mV7w}{192.168.244.73}{192.168.244.73:9300}{dimr}, peersRequestInFlight=false} peers request failed
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-workernode-0][192.168.244.73:9300][internal:discovery/request_peers] request_id [14999] timed out after [3002ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1083) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
[2021-04-27T11:14:37,251][WARN ][r.suppressed ] [elasticsearch-workernode-2] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:190) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:590) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:452) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:624) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
[2021-04-27T11:14:39,263][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elasticsearch-workernode-2] master not discovered or elected yet, an election requires at least 2 nodes with ids from [TKv7M-4mTTS-kk_zJS4bKA, tRlF2GySSz2yfwsFqFHJdQ, lH8lhxLUTdWbOsPf_9Wj1g], have discovered [{elasticsearch-workernode-2}{lH8lhxLUTdWbOsPf_9Wj1g}{37Z2CPO7Sv-dpWU6jeGcag}{192.168.244.201}{192.168.244.201:9300}{dimr}, {elasticsearch-workernode-0}{TKv7M-4mTTS-kk_zJS4bKA}{lYRMBiPxTveaA0srz0mV7w}{192.168.244.73}{192.168.244.73:9300}{dimr}, {elasticsearch-workernode-1}{tRlF2GySSz2yfwsFqFHJdQ}{DnW3CjoqSQa3Jz7ifugwkg}{192.168.244.9}{192.168.244.9:9300}{dimr}] which is a quorum; discovery will continue using [192.168.244.9:9300, 192.168.244.73:9300] from hosts providers and [{elasticsearch-workernode-0}{TKv7M-4mTTS-kk_zJS4bKA}{lYRMBiPxTveaA0srz0mV7w}{192.168.244.73}{192.168.244.73:9300}{dimr}, {elasticsearch-workernode-1}{tRlF2GySSz2yfwsFqFHJdQ}{DnW3CjoqSQa3Jz7ifugwkg}{192.168.244.9}{192.168.244.9:9300}{dimr}, {elasticsearch-workernode-2}{lH8lhxLUTdWbOsPf_9Wj1g}{37Z2CPO7Sv-dpWU6jeGcag}{192.168.244.201}{192.168.244.201:9300}{dimr}] from last-known cluster state; node term 1, last-accepted version 169 in term 1
[2021-04-27T11:14:44,399][DEBUG][o.e.d.PeerFinder ] [elasticsearch-workernode-2] Peer{transportAddress=192.168.244.9:9300, discoveryNode={elasticsearch-workernode-1}{tRlF2GySSz2yfwsFqFHJdQ}{DnW3CjoqSQa3Jz7ifugwkg}{192.168.244.9}{192.168.244.9:9300}{dimr}, peersRequestInFlight=false} peers request failed