Failed Transition


Thanks for responding. I saw this same behavior in two different clusters over the weekend. Both have the same structure: 3 master nodes, 2 data nodes and 2 client nodes. Both clusters were deployed via Helm chart and both are running on Kubernetes clusters. One Kubernetes cluster (Cluster A) is composed of 7 nodes and the other (Cluster B) is composed of 5 nodes. The Kubernetes nodes themselves are VMs running in OpenStack, each configured with 8 vCPU and 32GM RAM.

The indices are configured with 1 shard and 1 replica; all are small. I suspect that the current configuration is “sub-optimal” but I’m still in prototyping mode and waiting for our larger application efforts to settle down to where I can get a more realistic sense of data volumes and user activities before tackling optimization.

I manually hit the RETRY POLICY button in Kibana for the indices that failed on Cluster B; so there are currently no indices in a failed state there. However, on Cluster A, I only did that for one of the failed indices. I’ve included the _cat/shards output for this cluster below.

The indices currently showing with a “failed” status in the IM plugin’s “Managed Indices” page are:


The plug-in’s screens truncate the index names before the date portion, so I may be mistaken about which ones are affected. But, based on the order they appear when I sort by name, I believe that’s a correct list. I’m not sure why the “moss” index was affected since it had not passed the 3 day mark.

By the way, I would suggest the UI be tweaked to either not truncate the index name or, at least, show the full index name in a pop-up as you mouse over the name. Being able to filter the managed indexes by status would also be helpful.

As I mentioned in my original post, the policy is fairly simple: after 3 days, move the index from “hot” to “doomed” stage and the only action w/in the “doomed” stage is to delete the index. All of the indices that start with kubernetes_cluster-* have this same ISM policy assigned to them.

I was just re-reading the doc and see that in the description of the ACTIONS, it does mention that both retries and timeout are user-configurable. Your note mentioned suggests that these are hard-coded. Is user-configurability a newly introduced featuere or is the documentation mistaken (or prematurely updated)?

Thanks again for your assistance.

.kibana_1                                               0 p STARTED      21  65.1kb odfe-opendistro-es-data-1
.kibana_1                                               0 r STARTED      21  65.1kb odfe-opendistro-es-data-0
kubernetes_cluster-ingress-nginx-2020-03-23             0 r STARTED  884627   487mb odfe-opendistro-es-data-2
kubernetes_cluster-ingress-nginx-2020-03-23             0 p STARTED  864190 610.7mb odfe-opendistro-es-data-0
.opendistro-ism-config                                  0 r STARTED      31  77.3kb odfe-opendistro-es-data-2
.opendistro-ism-config                                  0 p STARTED      31  69.5kb odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.22-000004 0 r STARTED      14  31.7kb odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.22-000004 0 p STARTED      14  31.7kb odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-21                0 r STARTED   89356  19.1mb odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-21                0 p STARTED   89356    19mb odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.18-1      0 p STARTED      26  29.5kb odfe-opendistro-es-data-2
.opendistro-ism-managed-index-history-2020.03.18-1      0 r STARTED      26  29.5kb odfe-opendistro-es-data-0
kubernetes_cluster-jackson-2020-03-21                   0 p STARTED  746472 397.4mb odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-21                   0 r STARTED  746472 402.1mb odfe-opendistro-es-data-1
kubernetes_cluster-istio-system-2020-03-21              0 r STARTED     629 621.2kb odfe-opendistro-es-data-1
kubernetes_cluster-istio-system-2020-03-21              0 p STARTED     629 604.6kb odfe-opendistro-es-data-0
kubernetes_cluster-istio-system-2020-03-22              0 p STARTED     628 580.1kb odfe-opendistro-es-data-2
kubernetes_cluster-istio-system-2020-03-22              0 r STARTED     628 713.4kb odfe-opendistro-es-data-1
security-auditlog-2020.03.21                            0 p STARTED      30  56.5kb odfe-opendistro-es-data-2
security-auditlog-2020.03.21                            0 r STARTED      30  56.5kb odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-22              0 p STARTED     394 294.9kb odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-22              0 r STARTED     394   295kb odfe-opendistro-es-data-1
.opendistro_security                                    0 p STARTED       6  32.8kb odfe-opendistro-es-data-2
.opendistro_security                                    0 r STARTED       6  32.8kb odfe-opendistro-es-data-1
.opendistro_security                                    0 r STARTED       6    19kb odfe-opendistro-es-data-0
kubernetes_cluster-ingress-nginx-2020-03-22             0 p STARTED 1008703 550.1mb odfe-opendistro-es-data-2
kubernetes_cluster-ingress-nginx-2020-03-22             0 r STARTED 1008703 550.2mb odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-23                0 p STARTED   62702    15mb odfe-opendistro-es-data-2
kubernetes_cluster-monitoring-2020-03-23                0 r STARTED   69373  15.9mb odfe-opendistro-es-data-1
kubernetes_cluster-jackson-2020-03-22                   0 p STARTED  746460 401.9mb odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-22                   0 r STARTED  746460 397.9mb odfe-opendistro-es-data-1
kubernetes_cluster-moss-2020-03-20                      0 p STARTED  749560 396.3mb odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-20                      0 r STARTED  749560 396.7mb odfe-opendistro-es-data-1
kubernetes_cluster-ingress-nginx-2020-03-21             0 p STARTED 1008778 551.3mb odfe-opendistro-es-data-1
kubernetes_cluster-ingress-nginx-2020-03-21             0 r STARTED 1008778 551.7mb odfe-opendistro-es-data-0
kubernetes_cluster-istio-system-2020-03-23              0 r STARTED     564 665.8kb odfe-opendistro-es-data-2
kubernetes_cluster-istio-system-2020-03-23              0 p STARTED     564 604.7kb odfe-opendistro-es-data-0
kubernetes_cluster-jackson-2020-03-23                   0 r STARTED  636140 404.8mb odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-23                   0 p STARTED  657803 355.3mb odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-21              0 r STARTED     394 294.3kb odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-21              0 p STARTED     394 294.3kb odfe-opendistro-es-data-0
security-auditlog-2020.03.18                            0 r STARTED       6  78.3kb odfe-opendistro-es-data-2
security-auditlog-2020.03.18                            0 p STARTED       6  78.3kb odfe-opendistro-es-data-1
security-auditlog-2020.03.20                            0 p STARTED       7  81.7kb odfe-opendistro-es-data-2
security-auditlog-2020.03.20                            0 r STARTED       7  81.7kb odfe-opendistro-es-data-0
.opendistro-job-scheduler-lock                          0 p STARTED      21   2.6mb odfe-opendistro-es-data-1
.opendistro-job-scheduler-lock                          0 r STARTED      21   3.1mb odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-22                      0 r STARTED  749847   398mb odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-22                      0 p STARTED  749847 395.6mb odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.21-000003 0 r STARTED      20  33.6kb odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.21-000003 0 p STARTED      20  33.6kb odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-19                0 p STARTED   89206  18.9mb odfe-opendistro-es-data-2
kubernetes_cluster-monitoring-2020-03-19                0 r STARTED   89206  18.6mb odfe-opendistro-es-data-1
security-auditlog-2020.03.23                            0 p STARTED      15  83.9kb odfe-opendistro-es-data-2
security-auditlog-2020.03.23                            0 r STARTED      15    84kb odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-22                0 p STARTED   89324  18.7mb odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-22                0 r STARTED   89324    19mb odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-23                      0 p STARTED  649696 418.5mb odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-23                      0 r STARTED  648895 498.8mb odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.20-000002 0 p STARTED       7   9.8kb odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.20-000002 0 r STARTED       7   9.8kb odfe-opendistro-es-data-0
viya_ops-2020.03.23                                     0 p STARTED   37815  14.3mb odfe-opendistro-es-data-1
viya_ops-2020.03.23                                     0 r STARTED   37919  14.4mb odfe-opendistro-es-data-0
security-auditlog-2020.03.19                            0 r STARTED      34 117.5kb odfe-opendistro-es-data-2
security-auditlog-2020.03.19                            0 p STARTED      34 117.4kb odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-21                      0 p STARTED  749696 398.2mb odfe-opendistro-es-data-1
kubernetes_cluster-moss-2020-03-21                      0 r STARTED  749696 398.4mb odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-23              0 p STARTED     588 438.6kb odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-23              0 r STARTED     588 438.6kb odfe-opendistro-es-data-1
security-auditlog-2020.03.22                            0 p STARTED      20  27.3kb odfe-opendistro-es-data-2
security-auditlog-2020.03.22                            0 r STARTED      20  27.4kb odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.23-000005 0 p STARTED       3    18kb odfe-opendistro-es-data-2
.opendistro-ism-managed-index-history-2020.03.23-000005 0 r STARTED       3    18kb odfe-opendistro-es-data-0

Hi @GSmith,

Thanks for the information. We will fix the issue with the Kibana UI plugin truncating index names.

Regarding your “mos” index being affected even though it had not passed the 3 day mark: I’m assuming it does not say that index is in the doomed stage. Most likely it just had a timeout while attempting to check if it had any work to do and entered that failed state too.

To give a brief background on what’s happening internally:

The ISM plugin is running “jobs” based off one of our other plugins called Job Scheduler. We refer to the jobs in ISM as Managed Indices and they run on a pretty simple/dumb interval right now, i.e. run this job every 5 minutes (the default). Every index you apply a policy to has an associated Managed Index job for it that is running every 5 minutes. It’s running every 5 minutes to basically check if it has any “work” to do.

For the vast majority of “runs” it will have no work to do, such as when it’s waiting for 3 days to move into a “doomed” state. This is one of the improvements we are working on. The issue is currently that every 5 minutes it’s checking if it should move into this “doomed” state and each one of these runs has a chance to “fail”. When we say “fail”, it could mean a number of different things, but a few things are common across all runs.

Because we currently do not have any distinction between which actions are idempotent/safe and which are dangerous, we treat them all as requiring strict safety.

What that means is every time one of these Managed Index jobs runs, we first take a lock (that we store in a document) for that index. This is to ensure no other plugin running on another node attempts to run the same job for the same index (such as when a new node joins the cluster or one temporarily drops).

Once we have that lock and all our other checks pass (such as confirming cluster is healthy before doing any work) we also change the state of the Managed Index to “STARTING”. We store this in the master cluster state. Once we have acknowledgement from the master node, we attempt to do work. After that we set the state of the Managed Index back to something other than “STARTING” (such as COMPLETED, NOOP, FAILED, etc). We have a hardcoded 3 retry exponential backoff policy for these updates. And once we get acknowledgement of that update we are finished.

Now if any future execution ever runs into a current status of “STARTING” before it has updated it itself, then we know we attempted to do some work and failed to update the final status. Because we don’t know what that final update was (COMPLETED vs NOOP vs FAILED) we can’t assume one or the other and we enter into this “failed” state to let the user make the choice of retrying or not. From a users point of view, generally you don’t care about this and everything should just work. We are working on making this more resilient to random failures such as network issues, timeouts, etc. and have a lot of ideas already in place to improve this.

This should hopefully give some background as to why your indices are sometimes failing (and failing even if they aren’t actively trying to do something).

One thing you can do if you don’t require an interval as low as 5 minutes is to change the job_interval cluster setting. This can reduce the chances of these random timeout failures by simply reducing the number of times it can even happen. This does reduce the granularity of some actions, e.g. if you changed it to every 1 hour, then it could mean an index doesn’t transition to that “doomed” state until 3 days and 59 minutes etc. at worst if it previously executed at 2 days and 23 hours and 59 minutes.

As for the action retries/timeouts, those are a bit different from these internal retry timeouts we just talked about. As an example if for whatever reason your Managed Index was executing the “delete” action to delete your index and failed, the retries/timeouts specified in the action configuration allows this to be retried instead of entering a "failed’ state. So perhaps it failed the first time because of some networking issue and you had configured the “delete” action with 5 retries, then it would continue to retry up to 5 times. And the action timeout is keeping track of when the first attempt of an action started. If the timeout is ever breached then it enters the failed state. These unfortunately don’t help with the STARTING issue because we do not know if it’s even ok to retry from that (ideally something like delete is definitely ok to retry, whereas something like shrink would not be).

Hopefully that clarified the issue you’re facing.

@dbbaughe Thank you very much for the detailed explanation. I look forward to the coming enhancements. I’ll hold off on making any changes unless (until?) I see the problem re-appear. If it does, I’ll try increasing the job_interval setting to once an hour (or maybe once every few hours).

I found the following syntax works (I’m assuming the units is ‘minutes’):
PUT _cluster/settings
“persistent”: {
“opendistro.index_state_management.job_interval”: 60

Can you clarify the difference between persistent and transient cluster settings?

Other than the Index Management jobs, are there other jobs (or back-end processes) that are affected by this setting?

Thanks again for your assistance.

Hi @GSmith,

Regarding persistent vs transient:

Updates to settings can be persistent, meaning they apply across restarts, or transient, where they don’t survive a full cluster restart.

And that specific setting will only apply to the jobs running in Index State Management which is only the Managed Indices. Any other jobs that have configurable job_interval settings are configurable through their own namespaced settings.

@dbbaughe Thanks for clarification!

I’ve run into a problem when retrying the policy…but will start a new thread for that.


The above comments are very interesting as I am facing a similar issue. Can I please first summarise what I’ve understood to ensure I am thinking about things correctly and then I will go onto my problem.

My understanding is that the message “Previous action was not able to update IndexMetaData” is indicating that the outcome of the last operation that was performed on the index is unknown. The operation may have succeeded or failed but the only thing that’s known is that something (e.g. a timeout) prevented the metadata being written. When an index policy is in this state then it will no longer transition automatically.

My situation is, I have just moved from a single massive vanilla Elasticsearch cluster to a few smaller (OpenDistro) clusters and I’m seeing issues where transitions aren’t occurring and it’s serious enough that if left unchecked it could cause real problems.

Each cluster has ~1.5k indices and ingests a few TB’s of data a day. Some indices are very quiet while others are very busy. Indices are distributed to between 1 and 20 primary shards however I’ve got ISM configured so that when an index rolls over the shards should all be around 30GB, i.e. a 1 shard index rolls at 30GB and a 10 shard one at 300GB.

What I’ve been finding is that some shards are not rolling over and the most common message I’m seeing is “Previous action was not able to update IndexMetaData”. I have been manually rolling these over using the API described here. I’ve left some for a while to see if they would eventually roll over but they haven’t, even when they were 3x the rollover size.

Manually retrying isn’t going to work for us due to the number of indices and we can’t just leave things because it wouldn’t take long for some of the shards to fill a disk. At the moment I’m looking for advise on how we resolve the issue.

Do you think the clusters are too large and this (for some reason) is causing timeouts when writing the metadata? I’ve run into an other issue, posted here so not sure if this size of cluster has been tested against much? If you think this is likely to be the case, what is the maximum size you would recommend for a cluster?

Sorry for the long post :slight_smile:

Hi @govule,

First, thanks for using ISM and providing feedback.

Your understanding is correct regarding the message.

We have multiple improvements coming along to help with this.

  1. Before, every execution would always do this transaction when executing the action/step. And by transaction I mean update Step.Status to STARTING and then do the result after executing. A lot of the steps are idempotent meaning it’s completely fine to just do the step again without worrying about what happened before. This PR adds that change and is available in the future releases.
  2. The above fix helps a lot with things like delete, transition, etc. getting the metadata update timing out and not being able to proceed. But, we’re still doing a lot of updates to the cluster state when it shouldn’t be needed in the first place. So we’re currently adding some improvements to only use the Step.Status transactions when the step actually needs that guarantee (is not idempotent) and only if there is actionable work to be done (i.e. rollover conditions → false doesn’t need this transaction during that execution because it’s not going to call rollover). This should drastically reduce the amount of updates to the cluster state which lives on the master node and applies updates sequentially in a single thread.
  3. Making the timeout on the cluster state updates from ISM configurable. Currently they use the default of 30 seconds.
  4. Looking into potentially moving the metadata out of the cluster state which is bottlenecked by the sequential single threaded nature and into the ISM index which should scale a lot better for large clusters.

While these are being all done what you can do to help with these spurious timeouts is to increase the job_interval setting as mentioned in the above replies. This should reduce the number of cluster state updates going out and also reduce the # of times a timeout could occur. Obviously that is a short term fix though and hopefully we can get these other fixes/improvements out fast enough for you.

Thanks very much for the feedback @dbbaughe.

Point #4 sounds particularly interesting do you currently have a public ticket that I can follow for this?

I will be sure to update the job_interval setting, thanks for the tip :+1:

Hey @govule,

#4 has mainly been hallway discussions as we discuss scaling to tens of thousands of indices.

#1 is finished and #2 and #3 are being worked on right now. #2 should definitely alleviate a lot of this pain (except for the situation where you apply a policy to thousands of indices at once, that initial bootstrap of each one individually will initialize metadata in the cluster state per job).

We will look into prioritizing #4 though, just requires a lot more careful thought and testing.

I created an issue here for you: Performance: Look into moving ManagedIndexMetaData from cluster state into index · Issue #207 · opendistro-for-elasticsearch/index-management · GitHub

Great. Thanks very much for this, I’m looking forward to trying out the updates as they come out

@dbbaughe Do the index management plugin improvements released with ODFE 1.7 eliminate the need to increase the job_interval setting? We need to add some “smarts” to our indexing policies to have them roll-over based on size as well as time. I’m worried that running the job every 2 hours will increase the chances that our size roll-over happens ~2 hours too late.

Hi @GSmith,

The improvements released in ODFE 1.7 should help a lot with the type of issues faced before.

You should definitely be able to lower the job_interval. It will definitely depend on what type of issues you face though, e.g. if it was still issues with the transition failing I would expect that to be 100% eliminated as we have changed the logic to always retry the transitions.

The other big issue was having too many jobs running at once competing with each other in the master node pending tasks queue. All the tasks of the ISM type in the queue are now batched together whenever the master node is applying one of them in the queue. I did some light load performance tests on an empty cluster and the difference was 30 minute job intervals only being able to run ~1k jobs concurrently with anything over timing out on 10 data node, 3 dedicated master nodes vs 1 minute job interval being able to run 5k jobs concurrently with no issue at all besides the increase in CPU.

With all that being said I believe you should be able to reduce the job interval to whatever you want at this stage and if there are any issues that popup just let us know and we can see what the issue is and how to make it more reliable. The goal is you should be able to just set this and forget it exists :wink:

Thanks @dbbaughe. Sounds like some significant improvements that should allow me to increase execution frequency as-needed, and/or maybe even just leave it at the default, out-of-the-box setting. Thanks again.

Hello @dbbaughe

We run 1660 index on 6 data node with 3 dedicated master node and 3 ingestion node.
When I apply a new policy to all the index, It run on about 150 index then all the remaning task fail. I run again the failed one and get another 150 apply and so on… The ressource on the servers aren’t used that much…

This would be ok but I got a second issue. The last time I did the retry on all the failed, it lauched on the 600 remaining but is now stuck for more that 24h with the retry message… I dont seem to be able to cancel the action in another way as removing the policy apply and reapply it.

Is this a know issue and is there another way to cancel the retry action than removing the policy from all index?

Hey @Franckiboy,

Which version of ES and the plugin are you using?

Hello @dbbaughe
Thanks for the reply!
I currently use the 1.10.1 docker version Open Distro Index Management Plugin

Thanks @Franckiboy,

When you say all the remaining tasks fail, do you mean you apply the policy to all 1660 indices and 150 of them run and the remaining 1510 fail?

Could you post a copy of your policy you are applying and also what the failure message is?


Hello @dbbaughe
Here is my current policy :

"policy": {
    "policy_id": "Data cycle 7d SSD 180d HDD Delete",
    "description": "A Policy to move indices to slower storage after 7 days and then delete after 180days",
    "last_updated_time": 1603719735639,
    "schema_version": 1,
    "error_notification": null,
    "default_state": "hot",
    "states": [
            "name": "hot",
            "actions": [
                    "index_priority": {
                        "priority": 100
            "transitions": [
                    "state_name": "warm",
                    "conditions": {
                        "min_index_age": "7d"
            "name": "warm",
            "actions": [
                    "allocation": {
                        "require": {
                            "temp": "warm"
                        "wait_for": false
            "transitions": [
                    "state_name": "delete",
                    "conditions": {
                        "min_index_age": "180d"
            "name": "delete",
            "actions": [
                    "delete": {}
            "transitions": []

The fail is one of 3 things :
1 - Stay stuck at initializing and never resolve, can’t retry, need to remove the policy from index and add it again

2- Fail in the transition step, retry will work (again, around 150 will work on each retry until all are done)

3- Fail in allocation step, retry will work (this was the rarest form, only a few)

The error was very generic stating it was retry 0.
In the log I saw a lot of timeout when the task failed.

Hey @Franckiboy,

Thanks for the info. This isn’t something we have seen before so any extra information would be helpful.
If it says “Initializing” it just means it hasn’t really run yet… so we don’t really know if it’s stuck or just hasn’t been scheduled by the job scheduler to run.

When you say Fail in the transition step… could you post the actual error message? I don’t think we actually fail in this step so this is a bit confusing. Even if the update to metadata fails it would just ignore it and try again.

Any other useful logs you might see in the elasticearch.log file?

Could you also do a search on the .opendistro-ism-config index and verify how many managed index config documents you have? Just to confirm how many were created.

Hello @dbbaughe
Here is the log, I send you a sample of the moment it transition to an error state.
It run for the the first 157 and then fail the other 31 with the same error as follow :

[2020-11-03T13:27:44,264][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.07.29
[2020-11-03T13:27:44,675][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.09.25
[2020-11-03T13:27:44,868][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.09.14
[2020-11-03T13:27:45,435][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.10.01
[2020-11-03T13:27:45,608][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.05.14
[2020-11-03T13:27:45,752][ERROR][c.a.o.i.i.s.i.AttemptSetIndexPriorityStep] [ODE-DATA-005] Failed to set index priority to 100 [index=winlogbeat-test1-2020.07.07]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[winlogbeat-test1-2020.07.07/WLzGitPFSAKFMS8UhdJFCA]]) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0( ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.ArrayList.forEach( ~[?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1( ~[elasticsearch-7.9.1.jar:7.9.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$ ~[?:?]
at ~[?:?]
[2020-11-03T13:27:45,752][ERROR][c.a.o.i.i.s.i.AttemptSetIndexPriorityStep] [ODE-DATA-005] Failed to set index priority to 100 [index=winlogbeat-test1-2020.05.17]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[winlogbeat-test1-2020.05.17/kEncU1lDRCOPAnEnWxNhJQ]]) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0( ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.ArrayList.forEach( ~[?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1( ~[elasticsearch-7.9.1.jar:7.9.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$ ~[?:?]
at ~[?:?]
[2020-11-03T13:27:45,752][ERROR][c.a.o.i.i.s.i.AttemptSetIndexPriorityStep] [ODE-DATA-005] Failed to set index priority to 100 [index=winlogbeat-test1-2020.10.07]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[winlogbeat-test1-2020.10.07/Ru6iFDF5TYeR1fJT7NsKDw]]) within 30s

The error in the gui is
“cause”: “failed to process cluster event (update-settings [[winlogbeat-test1-2020.10.07/Ru6iFDF5TYeR1fJT7NsKDw]]) within 30s”,
“message”: “Failed to set index priority to 100 [index=winlogbeat-test1-2020.10.07]”

For the one stuck in initializing, I tried to wait for more than 2 days to see if they will resolve in some way but it never resolved.