Failed Transition

Hey @Franckiboy,

Thanks for the info. This isn’t something we have seen before so any extra information would be helpful.
If it says “Initializing” it just means it hasn’t really run yet… so we don’t really know if it’s stuck or just hasn’t been scheduled by the job scheduler to run.

When you say Fail in the transition step… could you post the actual error message? I don’t think we actually fail in this step so this is a bit confusing. Even if the update to metadata fails it would just ignore it and try again.

Any other useful logs you might see in the elasticearch.log file?

Could you also do a search on the .opendistro-ism-config index and verify how many managed index config documents you have? Just to confirm how many were created.

Hello @dbbaughe
Here is the log, I send you a sample of the moment it transition to an error state.
It run for the the first 157 and then fail the other 31 with the same error as follow :

[2020-11-03T13:27:44,264][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.07.29
[2020-11-03T13:27:44,675][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.09.25
[2020-11-03T13:27:44,868][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.09.14
[2020-11-03T13:27:45,435][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.10.01
[2020-11-03T13:27:45,608][INFO ][c.a.o.i.i.ManagedIndexRunner] [ODE-DATA-005] Finished executing attempt_set_index_priority for winlogbeat-test1-2020.05.14
[2020-11-03T13:27:45,752][ERROR][c.a.o.i.i.s.i.AttemptSetIndexPriorityStep] [ODE-DATA-005] Failed to set index priority to 100 [index=winlogbeat-test1-2020.07.07]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[winlogbeat-test1-2020.07.07/WLzGitPFSAKFMS8UhdJFCA]]) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.ArrayList.forEach(ArrayList.java:1510) ~[?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) ~[elasticsearch-7.9.1.jar:7.9.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) ~[?:?]
[2020-11-03T13:27:45,752][ERROR][c.a.o.i.i.s.i.AttemptSetIndexPriorityStep] [ODE-DATA-005] Failed to set index priority to 100 [index=winlogbeat-test1-2020.05.17]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[winlogbeat-test1-2020.05.17/kEncU1lDRCOPAnEnWxNhJQ]]) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.ArrayList.forEach(ArrayList.java:1510) ~[?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) ~[elasticsearch-7.9.1.jar:7.9.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) ~[elasticsearch-7.9.1.jar:7.9.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:832) ~[?:?]
[2020-11-03T13:27:45,752][ERROR][c.a.o.i.i.s.i.AttemptSetIndexPriorityStep] [ODE-DATA-005] Failed to set index priority to 100 [index=winlogbeat-test1-2020.10.07]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[winlogbeat-test1-2020.10.07/Ru6iFDF5TYeR1fJT7NsKDw]]) within 30s

The error in the gui is
{
“cause”: “failed to process cluster event (update-settings [[winlogbeat-test1-2020.10.07/Ru6iFDF5TYeR1fJT7NsKDw]]) within 30s”,
“message”: “Failed to set index priority to 100 [index=winlogbeat-test1-2020.10.07]”
}

For the one stuck in initializing, I tried to wait for more than 2 days to see if they will resolve in some way but it never resolved.

Hello @dbbaughe, do you need more/other logs?

Hey @Franckiboy,

For the failure posted in the log above, it seems like your pending queue is potentially getting a bit full/backed up if it’s failing ti process it within 30s. One thing I’d check is what are the tasks that are holding up the queue/taking a long time which you should be able to check with the pending tasks API.

One thing you can definitely do though is configure a retry on these type of failures so you don’t have to do a manual retry. You can see how to configure it here:

As for other logs – yes, do you have any logs of the transition failure which you mentioned in #2?

As for the job not starting at all… wondering if the initial metadata creation is stuck in the pending task queue or just not getting through since there are some timeouts in there. It’s hard to guess, but we plan to add some APIs to help with this. I think first thing is to check that pending task queue.

Hello @dbbaughe
Sorry for the delay!

I tried to see the pending task and it’s always empty when I check with
GET /_cluster/pending_tasks
and
GET /_cat/pending_tasks

I see some task with the GET _tasks
But nothing that seem to be explaining the issue.

Is it possible the transition step just take too much time to resolve since I don’t seem to have any issue with the policy once the transition phase is done?

For the log for the transition fail, all I got was what I’ve post on the previous post.
The error in the gui is

{
“cause”: “failed to process cluster event (update-settings [[winlogbeat-test1-2020.10.07/Ru6iFDF5TYeR1fJT7NsKDw]]) within 30s”,
“message”: “Failed to set index priority to 100 [index=winlogbeat-test1-2020.10.07]”
}

If you mean the log for the #3, allocation step, I wasn’t able to reproduce the issue but all the data is now on the right server and I dont have the disk space to move them as I did at first.

Thanks for your time it’s appreciated!

Francis