Failed Transition

I have a simple index management policy defined that deletes an index after 3 days. Over the weekend, I’ve noticed that this policy failed for a small number of the indexes to which it applied but worked for several others. In the Kibana Index Management panel, on the Managed Indices page, the status for the indices for which the policy failed is shown as “failed”. When I click on the link in the INFO column, the dialog displayed shows “message”: “Previous action was not able to update IndexMetaData.”

Looking through the Elasticsearch log messages, I see cluster of messages like the following, which I think correspond to the failed transition attempt:

[2020-03-23T01:16:13,519][WARN ][c.a.o.i.ManagedIndexRunner] [odfe-opendistro-es-data-2] Operation failed. Retrying in 250ms.\n
org.elasticsearch.transport.RemoteTransportException: [odfe-opendistro-es-master-1][10.254.2.161:9300][cluster:admin/ism/update/managedindexmetadata]\n
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (opendistro-ism) within 30s\n
\tat org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:143) ~[elasticsearch-7.4.2.jar:7.4.2]\n
\tat java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]\n
\tat org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:142) ~[elasticsearch-7.4.2.jar:7.4.2]\n
\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.4.2.jar:7.4.2]\n
\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]\n
\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]\n
\tat java.lang.Thread.run(Thread.java:835) ~[?:?]\n

Is it possible to increase the time allotted to process the index since the current 30 second limit doesn’t appear to be sufficient?

Will Index Management plugin continue trying? The note indicates that it will and I see multiple clusters of these notes but some of the indexes remain. So, does it “give up” at some point or should I assume it will eventually “catch up”?

The list of of indices shown in the plugin truncates the Index name field, so it is hard for me to be sure which index is which, but the affected indexes appear to be “small”, i.e. a few hundred MB, so it’s unclear why the request would be timing out. Is there additional logging that can be enabled that might be helpful?

I have been successful with manually retrying the action by clicking the RETRY POLICY button on the Managed Indices page of the IM plugin. But, obviously, I defined a policy so this would all be automated.

Can anyone offer some helpful advice/guidance?

Thanks!

Hi @GSmith,

Just to get some more information, could you let us know the setup of your cluster (i.e. number of nodes, indices, and shards)? And is the failure only during the transition part or also while execution any of the actions?

That error appears to be from the runner attempting to do some work and not being able to successfully update the IndexMetaData after said work is done because of that 30 second timeout. On the next execution the runner doesn’t know if it was able to complete the previous work or not so it fails itself in case the current work to be executed is not idempotent. We currently do not have that timeout for the update call configurable, but it could be something we add. And no the plugin won’t keep trying indefinitely which is why you eventually see it in the failed state. It retries up to 3 times, which also is currently not configurable, but something we could open up as a setting.

We do have some thoughts to improve this significantly, but will need some time to implement them. It would be helpful if you could let us know at which part your indices are failing though.

Thanks

Drew:

Thanks for responding. I saw this same behavior in two different clusters over the weekend. Both have the same structure: 3 master nodes, 2 data nodes and 2 client nodes. Both clusters were deployed via Helm chart and both are running on Kubernetes clusters. One Kubernetes cluster (Cluster A) is composed of 7 nodes and the other (Cluster B) is composed of 5 nodes. The Kubernetes nodes themselves are VMs running in OpenStack, each configured with 8 vCPU and 32GM RAM.

The indices are configured with 1 shard and 1 replica; all are small. I suspect that the current configuration is “sub-optimal” but I’m still in prototyping mode and waiting for our larger application efforts to settle down to where I can get a more realistic sense of data volumes and user activities before tackling optimization.

I manually hit the RETRY POLICY button in Kibana for the indices that failed on Cluster B; so there are currently no indices in a failed state there. However, on Cluster A, I only did that for one of the failed indices. I’ve included the _cat/shards output for this cluster below.

The indices currently showing with a “failed” status in the IM plugin’s “Managed Indices” page are:

kubernetes_cluster-monitoring-2020-03-19
kubernetes_cluster-moss-2020-03-20
kubernetes_cluster-moss-2020-03-22

The plug-in’s screens truncate the index names before the date portion, so I may be mistaken about which ones are affected. But, based on the order they appear when I sort by name, I believe that’s a correct list. I’m not sure why the “moss” index was affected since it had not passed the 3 day mark.

By the way, I would suggest the UI be tweaked to either not truncate the index name or, at least, show the full index name in a pop-up as you mouse over the name. Being able to filter the managed indexes by status would also be helpful.

As I mentioned in my original post, the policy is fairly simple: after 3 days, move the index from “hot” to “doomed” stage and the only action w/in the “doomed” stage is to delete the index. All of the indices that start with kubernetes_cluster-* have this same ISM policy assigned to them.

I was just re-reading the doc and see that in the description of the ACTIONS, it does mention that both retries and timeout are user-configurable. Your note mentioned suggests that these are hard-coded. Is user-configurability a newly introduced featuere or is the documentation mistaken (or prematurely updated)?

Thanks again for your assistance.
Greg

.kibana_1                                               0 p STARTED      21  65.1kb 10.254.1.151 odfe-opendistro-es-data-1
.kibana_1                                               0 r STARTED      21  65.1kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-ingress-nginx-2020-03-23             0 r STARTED  884627   487mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-ingress-nginx-2020-03-23             0 p STARTED  864190 610.7mb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-config                                  0 r STARTED      31  77.3kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro-ism-config                                  0 p STARTED      31  69.5kb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.22-000004 0 r STARTED      14  31.7kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.22-000004 0 p STARTED      14  31.7kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-21                0 r STARTED   89356  19.1mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-21                0 p STARTED   89356    19mb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.18-1      0 p STARTED      26  29.5kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro-ism-managed-index-history-2020.03.18-1      0 r STARTED      26  29.5kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-jackson-2020-03-21                   0 p STARTED  746472 397.4mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-21                   0 r STARTED  746472 402.1mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-istio-system-2020-03-21              0 r STARTED     629 621.2kb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-istio-system-2020-03-21              0 p STARTED     629 604.6kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-istio-system-2020-03-22              0 p STARTED     628 580.1kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-istio-system-2020-03-22              0 r STARTED     628 713.4kb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.21                            0 p STARTED      30  56.5kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.21                            0 r STARTED      30  56.5kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-22              0 p STARTED     394 294.9kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-22              0 r STARTED     394   295kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro_security                                    0 p STARTED       6  32.8kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro_security                                    0 r STARTED       6  32.8kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro_security                                    0 r STARTED       6    19kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-ingress-nginx-2020-03-22             0 p STARTED 1008703 550.1mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-ingress-nginx-2020-03-22             0 r STARTED 1008703 550.2mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-23                0 p STARTED   62702    15mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-monitoring-2020-03-23                0 r STARTED   69373  15.9mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-jackson-2020-03-22                   0 p STARTED  746460 401.9mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-22                   0 r STARTED  746460 397.9mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-moss-2020-03-20                      0 p STARTED  749560 396.3mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-20                      0 r STARTED  749560 396.7mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-ingress-nginx-2020-03-21             0 p STARTED 1008778 551.3mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-ingress-nginx-2020-03-21             0 r STARTED 1008778 551.7mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-istio-system-2020-03-23              0 r STARTED     564 665.8kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-istio-system-2020-03-23              0 p STARTED     564 604.7kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-jackson-2020-03-23                   0 r STARTED  636140 404.8mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-jackson-2020-03-23                   0 p STARTED  657803 355.3mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-21              0 r STARTED     394 294.3kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-21              0 p STARTED     394 294.3kb 10.254.2.140 odfe-opendistro-es-data-0
security-auditlog-2020.03.18                            0 r STARTED       6  78.3kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.18                            0 p STARTED       6  78.3kb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.20                            0 p STARTED       7  81.7kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.20                            0 r STARTED       7  81.7kb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-job-scheduler-lock                          0 p STARTED      21   2.6mb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-job-scheduler-lock                          0 r STARTED      21   3.1mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-22                      0 r STARTED  749847   398mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-22                      0 p STARTED  749847 395.6mb 10.254.2.140 odfe-opendistro-es-data-0
.opendistro-ism-managed-index-history-2020.03.21-000003 0 r STARTED      20  33.6kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.21-000003 0 p STARTED      20  33.6kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-monitoring-2020-03-19                0 p STARTED   89206  18.9mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-monitoring-2020-03-19                0 r STARTED   89206  18.6mb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.23                            0 p STARTED      15  83.9kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.23                            0 r STARTED      15    84kb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-22                0 p STARTED   89324  18.7mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-monitoring-2020-03-22                0 r STARTED   89324    19mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-23                      0 p STARTED  649696 418.5mb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-moss-2020-03-23                      0 r STARTED  648895 498.8mb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.20-000002 0 p STARTED       7   9.8kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.20-000002 0 r STARTED       7   9.8kb 10.254.2.140 odfe-opendistro-es-data-0
viya_ops-2020.03.23                                     0 p STARTED   37815  14.3mb 10.254.1.151 odfe-opendistro-es-data-1
viya_ops-2020.03.23                                     0 r STARTED   37919  14.4mb 10.254.2.140 odfe-opendistro-es-data-0
security-auditlog-2020.03.19                            0 r STARTED      34 117.5kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.19                            0 p STARTED      34 117.4kb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-moss-2020-03-21                      0 p STARTED  749696 398.2mb 10.254.1.151 odfe-opendistro-es-data-1
kubernetes_cluster-moss-2020-03-21                      0 r STARTED  749696 398.4mb 10.254.2.140 odfe-opendistro-es-data-0
kubernetes_cluster-cert-manager-2020-03-23              0 p STARTED     588 438.6kb 10.254.3.167 odfe-opendistro-es-data-2
kubernetes_cluster-cert-manager-2020-03-23              0 r STARTED     588 438.6kb 10.254.1.151 odfe-opendistro-es-data-1
security-auditlog-2020.03.22                            0 p STARTED      20  27.3kb 10.254.3.167 odfe-opendistro-es-data-2
security-auditlog-2020.03.22                            0 r STARTED      20  27.4kb 10.254.1.151 odfe-opendistro-es-data-1
.opendistro-ism-managed-index-history-2020.03.23-000005 0 p STARTED       3    18kb 10.254.3.167 odfe-opendistro-es-data-2
.opendistro-ism-managed-index-history-2020.03.23-000005 0 r STARTED       3    18kb 10.254.2.140 odfe-opendistro-es-data-0

Hi @GSmith,

Thanks for the information. We will fix the issue with the Kibana UI plugin truncating index names.

Regarding your “mos” index being affected even though it had not passed the 3 day mark: I’m assuming it does not say that index is in the doomed stage. Most likely it just had a timeout while attempting to check if it had any work to do and entered that failed state too.

To give a brief background on what’s happening internally:

The ISM plugin is running “jobs” based off one of our other plugins called Job Scheduler. We refer to the jobs in ISM as Managed Indices and they run on a pretty simple/dumb interval right now, i.e. run this job every 5 minutes (the default). Every index you apply a policy to has an associated Managed Index job for it that is running every 5 minutes. It’s running every 5 minutes to basically check if it has any “work” to do.

For the vast majority of “runs” it will have no work to do, such as when it’s waiting for 3 days to move into a “doomed” state. This is one of the improvements we are working on. The issue is currently that every 5 minutes it’s checking if it should move into this “doomed” state and each one of these runs has a chance to “fail”. When we say “fail”, it could mean a number of different things, but a few things are common across all runs.

Because we currently do not have any distinction between which actions are idempotent/safe and which are dangerous, we treat them all as requiring strict safety.

What that means is every time one of these Managed Index jobs runs, we first take a lock (that we store in a document) for that index. This is to ensure no other plugin running on another node attempts to run the same job for the same index (such as when a new node joins the cluster or one temporarily drops).

Once we have that lock and all our other checks pass (such as confirming cluster is healthy before doing any work) we also change the state of the Managed Index to “STARTING”. We store this in the master cluster state. Once we have acknowledgement from the master node, we attempt to do work. After that we set the state of the Managed Index back to something other than “STARTING” (such as COMPLETED, NOOP, FAILED, etc). We have a hardcoded 3 retry exponential backoff policy for these updates. And once we get acknowledgement of that update we are finished.

Now if any future execution ever runs into a current status of “STARTING” before it has updated it itself, then we know we attempted to do some work and failed to update the final status. Because we don’t know what that final update was (COMPLETED vs NOOP vs FAILED) we can’t assume one or the other and we enter into this “failed” state to let the user make the choice of retrying or not. From a users point of view, generally you don’t care about this and everything should just work. We are working on making this more resilient to random failures such as network issues, timeouts, etc. and have a lot of ideas already in place to improve this.

This should hopefully give some background as to why your indices are sometimes failing (and failing even if they aren’t actively trying to do something).

One thing you can do if you don’t require an interval as low as 5 minutes is to change the job_interval cluster setting. This can reduce the chances of these random timeout failures by simply reducing the number of times it can even happen. This does reduce the granularity of some actions, e.g. if you changed it to every 1 hour, then it could mean an index doesn’t transition to that “doomed” state until 3 days and 59 minutes etc. at worst if it previously executed at 2 days and 23 hours and 59 minutes.

As for the action retries/timeouts, those are a bit different from these internal retry timeouts we just talked about. As an example if for whatever reason your Managed Index was executing the “delete” action to delete your index and failed, the retries/timeouts specified in the action configuration allows this to be retried instead of entering a "failed’ state. So perhaps it failed the first time because of some networking issue and you had configured the “delete” action with 5 retries, then it would continue to retry up to 5 times. And the action timeout is keeping track of when the first attempt of an action started. If the timeout is ever breached then it enters the failed state. These unfortunately don’t help with the STARTING issue because we do not know if it’s even ok to retry from that (ideally something like delete is definitely ok to retry, whereas something like shrink would not be).

Hopefully that clarified the issue you’re facing.

@dbbaughe Thank you very much for the detailed explanation. I look forward to the coming enhancements. I’ll hold off on making any changes unless (until?) I see the problem re-appear. If it does, I’ll try increasing the job_interval setting to once an hour (or maybe once every few hours).

I found the following syntax works (I’m assuming the units is ‘minutes’):
PUT _cluster/settings
{
“persistent”: {
“opendistro.index_state_management.job_interval”: 60
}
}

Can you clarify the difference between persistent and transient cluster settings?

Other than the Index Management jobs, are there other jobs (or back-end processes) that are affected by this setting?

Thanks again for your assistance.

Hi @GSmith,

Regarding persistent vs transient:

Updates to settings can be persistent, meaning they apply across restarts, or transient, where they don’t survive a full cluster restart.

And that specific setting will only apply to the jobs running in Index State Management which is only the Managed Indices. Any other jobs that have configurable job_interval settings are configurable through their own namespaced settings.

@dbbaughe Thanks for clarification!

I’ve run into a problem when retrying the policy…but will start a new thread for that.

@dbbaughe

The above comments are very interesting as I am facing a similar issue. Can I please first summarise what I’ve understood to ensure I am thinking about things correctly and then I will go onto my problem.

My understanding is that the message “Previous action was not able to update IndexMetaData” is indicating that the outcome of the last operation that was performed on the index is unknown. The operation may have succeeded or failed but the only thing that’s known is that something (e.g. a timeout) prevented the metadata being written. When an index policy is in this state then it will no longer transition automatically.

My situation is, I have just moved from a single massive vanilla Elasticsearch cluster to a few smaller (OpenDistro) clusters and I’m seeing issues where transitions aren’t occurring and it’s serious enough that if left unchecked it could cause real problems.

Each cluster has ~1.5k indices and ingests a few TB’s of data a day. Some indices are very quiet while others are very busy. Indices are distributed to between 1 and 20 primary shards however I’ve got ISM configured so that when an index rolls over the shards should all be around 30GB, i.e. a 1 shard index rolls at 30GB and a 10 shard one at 300GB.

What I’ve been finding is that some shards are not rolling over and the most common message I’m seeing is “Previous action was not able to update IndexMetaData”. I have been manually rolling these over using the API described here. I’ve left some for a while to see if they would eventually roll over but they haven’t, even when they were 3x the rollover size.

Manually retrying isn’t going to work for us due to the number of indices and we can’t just leave things because it wouldn’t take long for some of the shards to fill a disk. At the moment I’m looking for advise on how we resolve the issue.

Do you think the clusters are too large and this (for some reason) is causing timeouts when writing the metadata? I’ve run into an other issue, posted here so not sure if this size of cluster has been tested against much? If you think this is likely to be the case, what is the maximum size you would recommend for a cluster?

Sorry for the long post :slight_smile:

Hi @govule,

First, thanks for using ISM and providing feedback.

Your understanding is correct regarding the message.

We have multiple improvements coming along to help with this.

  1. Before, every execution would always do this transaction when executing the action/step. And by transaction I mean update Step.Status to STARTING and then do the result after executing. A lot of the steps are idempotent meaning it’s completely fine to just do the step again without worrying about what happened before. This PR adds that change and is available in the future releases.
  2. The above fix helps a lot with things like delete, transition, etc. getting the metadata update timing out and not being able to proceed. But, we’re still doing a lot of updates to the cluster state when it shouldn’t be needed in the first place. So we’re currently adding some improvements to only use the Step.Status transactions when the step actually needs that guarantee (is not idempotent) and only if there is actionable work to be done (i.e. rollover conditions → false doesn’t need this transaction during that execution because it’s not going to call rollover). This should drastically reduce the amount of updates to the cluster state which lives on the master node and applies updates sequentially in a single thread.
  3. Making the timeout on the cluster state updates from ISM configurable. Currently they use the default of 30 seconds.
  4. Looking into potentially moving the metadata out of the cluster state which is bottlenecked by the sequential single threaded nature and into the ISM index which should scale a lot better for large clusters.

While these are being all done what you can do to help with these spurious timeouts is to increase the job_interval setting as mentioned in the above replies. This should reduce the number of cluster state updates going out and also reduce the # of times a timeout could occur. Obviously that is a short term fix though and hopefully we can get these other fixes/improvements out fast enough for you.

Thanks very much for the feedback @dbbaughe.

Point #4 sounds particularly interesting do you currently have a public ticket that I can follow for this?

I will be sure to update the job_interval setting, thanks for the tip :+1:

Hey @govule,

#4 has mainly been hallway discussions as we discuss scaling to tens of thousands of indices.

#1 is finished and #2 and #3 are being worked on right now. #2 should definitely alleviate a lot of this pain (except for the situation where you apply a policy to thousands of indices at once, that initial bootstrap of each one individually will initialize metadata in the cluster state per job).

We will look into prioritizing #4 though, just requires a lot more careful thought and testing.

I created an issue here for you: Performance: Look into moving ManagedIndexMetaData from cluster state into index · Issue #207 · opendistro-for-elasticsearch/index-management · GitHub

Great. Thanks very much for this, I’m looking forward to trying out the updates as they come out

@dbbaughe Do the index management plugin improvements released with ODFE 1.7 eliminate the need to increase the job_interval setting? We need to add some “smarts” to our indexing policies to have them roll-over based on size as well as time. I’m worried that running the job every 2 hours will increase the chances that our size roll-over happens ~2 hours too late.

Hi @GSmith,

The improvements released in ODFE 1.7 should help a lot with the type of issues faced before.

You should definitely be able to lower the job_interval. It will definitely depend on what type of issues you face though, e.g. if it was still issues with the transition failing I would expect that to be 100% eliminated as we have changed the logic to always retry the transitions.

The other big issue was having too many jobs running at once competing with each other in the master node pending tasks queue. All the tasks of the ISM type in the queue are now batched together whenever the master node is applying one of them in the queue. I did some light load performance tests on an empty cluster and the difference was 30 minute job intervals only being able to run ~1k jobs concurrently with anything over timing out on 10 data node, 3 dedicated master nodes vs 1 minute job interval being able to run 5k jobs concurrently with no issue at all besides the increase in CPU.

With all that being said I believe you should be able to reduce the job interval to whatever you want at this stage and if there are any issues that popup just let us know and we can see what the issue is and how to make it more reliable. The goal is you should be able to just set this and forget it exists :wink:

Thanks @dbbaughe. Sounds like some significant improvements that should allow me to increase execution frequency as-needed, and/or maybe even just leave it at the default, out-of-the-box setting. Thanks again.

Hello @dbbaughe

We run 1660 index on 6 data node with 3 dedicated master node and 3 ingestion node.
When I apply a new policy to all the index, It run on about 150 index then all the remaning task fail. I run again the failed one and get another 150 apply and so on… The ressource on the servers aren’t used that much…

This would be ok but I got a second issue. The last time I did the retry on all the failed, it lauched on the 600 remaining but is now stuck for more that 24h with the retry message… I dont seem to be able to cancel the action in another way as removing the policy apply and reapply it.

Is this a know issue and is there another way to cancel the retry action than removing the policy from all index?

Hey @Franckiboy,

Which version of ES and the plugin are you using?

Hello @dbbaughe
Thanks for the reply!
I currently use the 1.10.1 docker version
1.10.1.1 Open Distro Index Management Plugin

Thanks @Franckiboy,

When you say all the remaining tasks fail, do you mean you apply the policy to all 1660 indices and 150 of them run and the remaining 1510 fail?

Could you post a copy of your policy you are applying and also what the failure message is?

Thanks

Hello @dbbaughe
Here is my current policy :

"policy": {
    "policy_id": "Data cycle 7d SSD 180d HDD Delete",
    "description": "A Policy to move indices to slower storage after 7 days and then delete after 180days",
    "last_updated_time": 1603719735639,
    "schema_version": 1,
    "error_notification": null,
    "default_state": "hot",
    "states": [
        {
            "name": "hot",
            "actions": [
                {
                    "index_priority": {
                        "priority": 100
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "warm",
                    "conditions": {
                        "min_index_age": "7d"
                    }
                }
            ]
        },
        {
            "name": "warm",
            "actions": [
                {
                    "allocation": {
                        "require": {
                            "temp": "warm"
                        },
                        "wait_for": false
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "delete",
                    "conditions": {
                        "min_index_age": "180d"
                    }
                }
            ]
        },
        {
            "name": "delete",
            "actions": [
                {
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ]
}

The fail is one of 3 things :
1 - Stay stuck at initializing and never resolve, can’t retry, need to remove the policy from index and add it again

2- Fail in the transition step, retry will work (again, around 150 will work on each retry until all are done)

3- Fail in allocation step, retry will work (this was the rarest form, only a few)

The error was very generic stating it was retry 0.
In the log I saw a lot of timeout when the task failed.