Versions:
OpenSearch 1.2.4 on OEL7
Describe the issue:
- Master nodes (masternode01/02/03) start showing timeout errors
2023-02-10T08:51:12,229 worker][T#4] [W] org.ope.tra.InboundHandler - [UID=] - handling inbound transport message [InboundMessage{Header{90}{1.2.4}{5251905774}{true}{false}{false}{false}{cluster:monitor/state}}] took [7267ms] which is above the warn threshold of [5000ms]
2023-02-10T08:51:16,003 worker][T#2] [W] org.ope.tra.TransportService - [UID=] - Received response for a request that has timed out, sent [15305ms] ago, timed out [5186ms] ago, action [internal:coordination/fault_detection/follower_check], node [{masternode03.example.com}{aLlNr3otQIqjrGfcJgr1PA}{Vt7e2b1XQ_afLPLGeq7OPg}{masternode03.example.com}{xxx.xxx.xxx.252:9041}{m}{shard_indexing_pressure_enabled=true}], id [101082954]
- shortly afterwards, primary master throws the following exception
2023-02-10T08:52:54,486 teTask][T#1] [W] org.ope.clu.ser.MasterService - [UID=] - failing [node-left[{qosdatapm03.example.com}{AfsN7i6SQK-12qQgX9j1ig}{FD6tmDOkT7ec_uFJhQgqbQ}{qosdatapm03.example.com}{172.20.18.37:9041}{d}{shard_indexing_pressure_enabled=true} reason: followers check retry count exceeded, {masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} reason: followers check retry count exceeded]]: failed to commit cluster state version [19174]
org.opensearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
at org.opensearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1681) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:296) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:118) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:80) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1592) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:138) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:189) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Publication.access$500(Publication.java:55) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Publication$PublicationTarget.onFaultyNode(Publication.java:324) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Publication.lambda$onFaultyNode$2(Publication.java:106) ~[opensearch-1.2.4.jar:1.2.4]
at java.util.ArrayList.forEach(ArrayList.java:1541) ~[?:?]
at org.opensearch.cluster.coordination.Publication.onFaultyNode(Publication.java:106) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Publication.start(Publication.java:83) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1303) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService.publish(MasterService.java:303) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:213) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238) [opensearch-1.2.4.jar:1.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:887) [?:?]
Caused by: org.opensearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum
at org.opensearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:187) ~[opensearch-1.2.4.jar:1.2.4]
... 19 more
- All master nodes start throwing errors like below for about 4 hours till all the master nodes were restarted
2023-02-10T08:52:55,009 teTask][T#1] [W] org.ope.clu.ser.MasterService - [UID=] - failing [elected-as-master ([2] nodes joined)[{masternode03.example.com}{aLlNr3otQIqjrGfcJgr1PA}{Vt7e2b1XQ_afLPLGeq7OPg}{masternode03.example.com}{xxx.xxx.xxx.252:9041}{m}{shard_indexing_pressure_enabled=true} elect leader, {masternode02.example.com}{VeWJEcurTOOQ0qAOoHfppQ}{BcrkMr6aT9i9NWhZOu7dng}{masternode02.example.com}{xxx.xxx.xxx.251:9041}{m}{shard_indexing_pressure_enabled=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], node-join[{masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} join existing leader, {masternode01.example.com}{TyestOnKSi22HxoORZfhLw}{cWB0_1xCTry1OIeSvtbLMQ}{masternode01.example.com}{xxx.xxx.xxx.250:9041}{m}{shard_indexing_pressure_enabled=true} join existing leader]]: failed to commit cluster state version [19174]
org.opensearch.cluster.coordination.FailedToCommitClusterStateException: node is no longer master for term 8 while handling publication
at org.opensearch.cluster.coordination.Coordinator.publish(Coordinator.java:1257) ~[opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService.publish(MasterService.java:303) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:175) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:213) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275) [opensearch-1.2.4.jar:1.2.4]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238) [opensearch-1.2.4.jar:1.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:887) [?:?]
Observations
- The issue seems to have started after the primary master was vMotion-ed to a different VM host
Questions:
- What is the reason why the masters are not able to commit cluster state or elect a new master? Shouldn’t a new master be elected even if the primary master went down or is not reachable?
- The cluster was in green state all throughout the time masters were disconnected. This resulted in a delay in raising an alert about this issue. Why did the cluster not move to RED state?
Configuration:
NA
Relevant Logs or Screenshots:
Complete logs from all 3 master nodes from the point all was well to the point where they stopped talking to each other (step 3).
masternode01.log - Pastebin.com (master: masternode01)
masternode02.log - Pastebin.com (Primary master: masternode02)
masternode03.log - Pastebin.com (master: masternode03)