I’ve been finding that many of my indices are becoming stuck during the force_merge action and as a result never move on to transition into the next state which is to delete the index. As a result of this it’s becoming very hard to keep the cluster running. I’m continually having to manually juggle things to prevent hitting node shard limits.
I have updated my policies to remove the force_merge action however this isn’t helping the indices which are currently stuck. Is there a way to unstick them?
I’m currently running v1.6 of the plugin.
Could you clarify a bit more on “getting stuck”? Which part of the action do they get stuck at?
If they are stuck in the middle of the force_merge action then the quickest way would be to remove the policy from the indices that are “stuck” and then either a) manually do what you intended to do or b) add a temporary policy to finish the rest of their lifecycle.
Perhaps we could consider an administrative API that allows someone to “override” a managed index to skip the current action it’s on when it gets stuck like this.
But, either way if possible please let us know how (if possible) and where these got stuck so we can look into it. The best option is for us to just make sure it can’t happen so you don’t have to do anything manually.
Thanks for your response. Basically, in many cases after an index rolled over they entered the force_merge action and never left this and so would not then transition to the deletion phase - typically 10 days after the force_merge should have happened.
I had looked at the tasks using the ES Tasks API and it wasn’t uncommon for these merges to take up to 5 hours to complete…it’s possible some merges were taking even longer but I just didn’t see them. However none of the merges were taking days to complete so it seemed pretty clear that the ISM state machines were getting stuck.
I have since removed the force_merge action from all policies warm phase and this has helped somewhat. In addition though I’ve create a little job that fires every hour which checks for pipelines in a dodgy state and helps them along a little - I’m still seeing many pipelines ceasing up due to timeouts writing their metadata even after increasing the transition attempt frequency to 1h.
BTW, I’m still on OpenDistro v1.6. I understand there have been significant improvements to ISM in v1.7 and would like to try and upgrade soon.
Is there a particular node type that ISM executes on? Each of our clusters have dedicated master, data and ingest/client-coordination nodes. If the ISM plugin doesn’t execute on data nodes then this would at least allow us to test out the new plugin very quickly on larger clusters.
We are experiencing the same issues. We have a policy setup that transitions to a warm state after 1 day (which is when logstash rolls over to writing to a new index). The warm state marks the index as read only and then does a force merge down to 1 segment. Most of the indices get stuck on the “Force Merge” action with the Info field saying “Started Force Merge”.
I have checked every index in this state and the force merges were successful. All of the indices are at 1 segment per shard. Its only the ISM state of the index that is stuck.
In Kibana, on the Managed Indices page, if I select the index the “Retry Policy” button remains disabled as well.
I could try re-applying the policy and forcing the state to warm. However, that has the negative consequence of starting the index’s timer back at zero. Our policy also dictates to delete the index after 15 days. If I re-applied the policy then that 15 day counter would start over and the index would live for longer than 15 days.
I should note that the merge process for our indices can take upwards of 2-5 hours. We are running on AWS Elasticsearch and thus we have no access to the server or any Elasticsearch, Kibana or Opendistro configuration files.