We have upgraded a few of our clusters to OD 1.12. We have had two instances (different clusters) of master nodes terminating then starting back up again (not an issue in itself) and the cluster not able to remove them from the cluster state. The task for removal is there blocking all other tasks. The two times it has happened we have ended up restarting the whole cluster. We are now going to test this in our dev environment but has anybody else seen this issue?
Just to clarify a bit here with the events that occur
- active master gets killed (-9)
- next master takes over and all seem fine
- the pending tasks lists node-left and election-to-master as insert order 1 and 2
- this never clears and all other tasks pile up behind them
- the terminated previous master node never gets removed from _cat/nodes (shows IP and node name but with no metrics)
- the restart of the terminated node fails to join because it seems the ephemeral id has changed but the cluster state still has the old one. The node-join task appears further down in the ever increasing tasks list when it attempts to rejoin.
It seems that because the tasks stop getting processed it never clears of the old node from the cluster state to allow the new one in. Luckily it seems that the tasks are memory resident so a full cluster shutdown and startup clears them and the cluster comes back up fine without issue.