Master nodes not able to commit cluster state and cluster stuck till restart of master nodes

Versions:
OpenSearch 1.2.4 on OEL7

Describe the issue:

  1. Master nodes (masternode01/02/03) start showing timeout errors
2023-02-10T08:51:12,229 worker][T#4] [W] org.ope.tra.InboundHandler          - [UID=] - handling inbound transport message [InboundMessage{Header{90}{1.2.4}{5251905774}{true}{false}{false}{false}{cluster:monitor/state}}] took [7267ms] which is above the warn threshold of [5000ms]
2023-02-10T08:51:16,003 worker][T#2] [W] org.ope.tra.TransportService        - [UID=] - Received response for a request that has timed out, sent [15305ms] ago, timed out [5186ms] ago, action [internal:coordination/fault_detection/follower_check], node [{masternode03.example.com}{aLlNr3otQIqjrGfcJgr1PA}{Vt7e2b1XQ_afLPLGeq7OPg}{masternode03.example.com}{xxx.xxx.xxx.252:9041}{m}{shard_indexing_pressure_enabled=true}], id [101082954]
  1. shortly afterwards, primary master throws the following exception
2023-02-10T08:52:54,486 teTask][T#1] [W] org.ope.clu.ser.MasterService       - [UID=] - failing [node-left[{qosdatapm03.example.com}{AfsN7i6SQK-12qQgX9j1ig}{FD6tmDOkT7ec_uFJhQgqbQ}{qosdatapm03.example.com}{172.20.18.37:9041}{d}{shard_indexing_pressure_enabled=true} reason: followers check retry count exceeded, {masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} reason: followers check retry count exceeded]]: failed to commit cluster state version [19174]
org.opensearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.opensearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1681) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:118) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:80) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1592) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:138) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:189) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.access$500(Publication.java:55) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:324) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:106) ~[opensearch-1.2.4.jar:1.2.4]
        at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
        at org.opensearch.cluster.coordination.Publication.onFaultyNode(Publication.java:106) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Publication.start(Publication.java:83) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1303) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.publish(MasterService.java:303) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:213) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238) [opensearch-1.2.4.jar:1.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:887) [?:?]
Caused by: org.opensearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
        at org.opensearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:187) ~[opensearch-1.2.4.jar:1.2.4]
        ... 19 more
  1. All master nodes start throwing errors like below for about 4 hours till all the master nodes were restarted
2023-02-10T08:52:55,009 teTask][T#1] [W] org.ope.clu.ser.MasterService       - [UID=] - failing [elected-as-master ([2] nodes joined)[{masternode03.example.com}{aLlNr3otQIqjrGfcJgr1PA}{Vt7e2b1XQ_afLPLGeq7OPg}{masternode03.example.com}{xxx.xxx.xxx.252:9041}{m}{shard_indexing_pressure_enabled=true} elect leader, {masternode02.example.com}{VeWJEcurTOOQ0qAOoHfppQ}{BcrkMr6aT9i9NWhZOu7dng}{masternode02.example.com}{xxx.xxx.xxx.251:9041}{m}{shard_indexing_pressure_enabled=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], node-join[{masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} join existing leader, {masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} join existing leader]]: failed to commit cluster state version [19174]
org.opensearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 8 while handling publication
        at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1257) ~[opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.publish(MasterService.java:303) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:213) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275) [opensearch-1.2.4.jar:1.2.4]
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238) [opensearch-1.2.4.jar:1.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:887) [?:?]

Observations

  1. The issue seems to have started after the primary master was vMotion-ed to a different VM host

Questions:

  1. What is the reason why the masters are not able to commit cluster state or elect a new master? Shouldn’t a new master be elected even if the primary master went down or is not reachable?
  2. The cluster was in green state all throughout the time masters were disconnected. This resulted in a delay in raising an alert about this issue. Why did the cluster not move to RED state?

Configuration:
NA

Relevant Logs or Screenshots:

Complete logs from all 3 master nodes from the point all was well to the point where they stopped talking to each other (step 3).

masternode01.log - Pastebin.com (master: masternode01)
masternode02.log - Pastebin.com (Primary master: masternode02)
masternode03.log - Pastebin.com (master: masternode03)

Any pointers anyone?