@everbeck32 managing indices and shards cost master node resources, and adding/remove data node involves shard movement and recovery, also causes master node resources. Maybe something happened at mid Sunday, e.g. some expensive query or some resource consuming job execution caused the problem.
The fist thing we need to do is to stabilize the cluster. reduce or stop traffic to your cluster may help.
In addition, I don’t know if you still have access to your cluster state and stats API, you may want to observe the pending tasks, and node CPU and JVM node metrics to ensure no node is overloaded if you can access the APIs
We turned off Logstash ~9 hours ago, so I don’t think there’s much else adding to the traffic.
I can check, but I believe our Cluster State and Stats APIs will be blocked by the OpenDistro Security Plugin issue (that happened with a basic GET earlier today).
@zengyan-amazon I gave my max replies in a day so I’m editing this comment.
We ended up restoring our indexes from snapshot and lost our tenants. Is it possible to restore our .kibana tenant indexes from snapshot? There is very little documentation on this, so we’ll take any recommendation you’ve got
@everbeck32 if there is no traffic to the cluster, and the cluster is still unstable, maybe consider restart all nodes in the cluster (you may want to wait for some time so that the majority of the nodes can join the cluster, thus .opendistro_security index can be recovered, then the security plugin on each node can be initialized).
if it doesn’t solve the issue, I guess I would suggest to rebuild the cluster and restore from snapshot.
We tried this just before you suggested, but we have a process that turns them back on, so we did full system restart, we have to get another team to turn that restart feature off… lol
If I understand correctly, your cluster was running with dedicated master nodes and suddenly master node started going down. And now, you lost all three master nodes.
Once master nodes are down, cluster become completely inaccessible as there is no master node which can perform management tasks.
Once you get your master up and running, it will not complain about missing .opendistro_security index. (this will be true for any other indices).
tenant indices are just common ES indices, you can restore them from snapshot just like other indices. Please note the your tenant entities are stored in the .opendistro_security index, you will need to restore that index so that you can see your tenants.