Master nodes not able to commit cluster state and cluster stuck till restart of master nodes

ronniepg · February 11, 2023, 2:06am

Versions:
OpenSearch 1.2.4 on OEL7

Describe the issue:

Master nodes (masternode01/02/03) start showing timeout errors

2023-02-10T08:51:12,229 worker][T#4] [W] org.ope.tra.InboundHandler          - [UID=] - handling inbound transport message [InboundMessage{Header{90}{1.2.4}{5251905774}{true}{false}{false}{false}{cluster:monitor/state}}] took [7267ms] which is above the warn threshold of [5000ms]
2023-02-10T08:51:16,003 worker][T#2] [W] org.ope.tra.TransportService        - [UID=] - Received response for a request that has timed out, sent [15305ms] ago, timed out [5186ms] ago, action [internal:coordination/fault_detection/follower_check], node [{masternode03.example.com}{aLlNr3otQIqjrGfcJgr1PA}{Vt7e2b1XQ_afLPLGeq7OPg}{masternode03.example.com}{xxx.xxx.xxx.252:9041}{m}{shard_indexing_pressure_enabled=true}], id [101082954]

shortly afterwards, primary master throws the following exception

2023-02-10T08:52:54,486 teTask][T#1] [W] org.ope.clu.ser.MasterService       - [UID=] - failing [node-left[{qosdatapm03.example.com}{AfsN7i6SQK-12qQgX9j1ig}{FD6tmDOkT7ec_uFJhQgqbQ}{qosdatapm03.example.com}{172.20.18.37:9041}{d}{shard_indexing_pressure_enabled=true} reason: followers check retry count exceeded, {masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} reason: followers check retry count exceeded]]: failed to commit cluster state version [19174]
org.opensearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.opensearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1681) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:118) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:80) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1592) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:138) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:189) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.access$500(Publication.java:55) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:324) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:106) ~[opensearch-1.2.4.jar:1.2.4]
        at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
        at org.opensearch.cluster.coordination.Publication.onFaultyNode(Publication.java:106) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.start(Publication.java:83) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1303) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.publish(MasterService.java:303) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:213) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238) [opensearch-1.2.4.jar:1.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:887) [?:?]
Caused by: org.opensearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
        at org.opensearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:187) ~[opensearch-1.2.4.jar:1.2.4]
        ... 19 more

All master nodes start throwing errors like below for about 4 hours till all the master nodes were restarted

2023-02-10T08:52:55,009 teTask][T#1] [W] org.ope.clu.ser.MasterService       - [UID=] - failing [elected-as-master ([2] nodes joined)[{masternode03.example.com}{aLlNr3otQIqjrGfcJgr1PA}{Vt7e2b1XQ_afLPLGeq7OPg}{masternode03.example.com}{xxx.xxx.xxx.252:9041}{m}{shard_indexing_pressure_enabled=true} elect leader, {masternode02.example.com}{VeWJEcurTOOQ0qAOoHfppQ}{BcrkMr6aT9i9NWhZOu7dng}{masternode02.example.com}{xxx.xxx.xxx.251:9041}{m}{shard_indexing_pressure_enabled=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], node-join[{masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} join existing leader, {masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} join existing leader]]: failed to commit cluster state version [19174]
org.opensearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 8 while handling publication
        at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1257) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.publish(MasterService.java:303) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:213) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238) [opensearch-1.2.4.jar:1.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:887) [?:?]

Observations

The issue seems to have started after the primary master was vMotion-ed to a different VM host

Questions:

What is the reason why the masters are not able to commit cluster state or elect a new master? Shouldn’t a new master be elected even if the primary master went down or is not reachable?
The cluster was in green state all throughout the time masters were disconnected. This resulted in a delay in raising an alert about this issue. Why did the cluster not move to RED state?

Configuration:
NA

Relevant Logs or Screenshots:

Complete logs from all 3 master nodes from the point all was well to the point where they stopped talking to each other (step 3).

masternode01.log - Pastebin.com (master: masternode01)
masternode02.log - Pastebin.com (Primary master: masternode02)
masternode03.log - Pastebin.com (master: masternode03)

ronniepg · February 23, 2023, 1:14am

Any pointers anyone?

Topic		Replies	Views
Master and follower doesn't get checker messages Open Source Elasticsearch and Kibana	4	1337	July 12, 2021
Nodes fall out of the cluster es 7.9.1 Open Source Elasticsearch and Kibana	4	1836	March 29, 2021
ElasticSearch Master Nodes Unable to Join Cluster - OpenDistro Security Plugin Open Source Elasticsearch and Kibana	25	7774	November 10, 2020
Cluster formation failure Open Source Elasticsearch and Kibana	1	593	February 7, 2023
Replication failed Cross-Cluster Replication troubleshoot	7	1589	January 28, 2022

Master nodes not able to commit cluster state and cluster stuck till restart of master nodes

Related topics