ManagedIndexRunner stops rollover and transition attempts

samuelp · November 12, 2024, 7:45pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch 1.3.6
RHEL 8.10

Describe the issue:
I am experiencing an issue where some or all indices fail to rollover despite matching the criteria that has been put on the ISM Policy. I am troubleshooting this issue across multiple environments that include an opensearch component where the pattern can vary slightly with each environment.

In one environment(5 node Opensearch cluster), I can see a change in the opensearch log behavior prior to an OS patching event that caused the servers to be rebooted and opensearch services restarted. The entries prior to patching looked similar to the following:

[2024-10-15T00:00:02,454][INFO ][o.o.i.i.ManagedIndexRunner] [192.0.2.71] Executing attempt_transition_step for dataindex_primarymetric-000044
[2024-10-15T00:00:02,455][INFO ][o.o.i.i.ManagedIndexRunner] [192.0.2.71] Finished executing attempt_transition_step for dataindex_primarymetric-000044
[2024-10-15T00:00:02,489][INFO ][o.o.j.s.JobScheduler     ] [192.0.2.71] Will delay 96923 miliseconds for next execution of job dataindex_secondarymetric-000165
[2024-10-15T00:00:02,616][INFO ][o.o.j.s.JobScheduler     ] [192.0.2.71] Will delay 132407 miliseconds for next execution of job dataindex_tertiarymetric-000161

And after patching only showed the JobScheduler staggering events but no entries related to ManagedIndexRunner

[2024-10-16T00:00:05,740][INFO ][o.o.j.s.JobScheduler     ] [192.0.2.71] Will delay 100262 miliseconds for next execution of job dataindex_primarymetric-000047
[2024-10-16T00:00:05,998][INFO ][o.o.j.s.JobScheduler     ] [192.0.2.71] Will delay 121054 miliseconds for next execution of job dataindex_secondarymetric-000169
[2024-10-16T00:00:06,037][INFO ][o.o.j.s.JobScheduler     ] [192.0.2.71] Will delay 31765 miliseconds for next execution of job dataindex_tertiarymetric-000163

For this environment, even after a successful manual rollover of one of the larger indices that had failed to automatically rollover, there were no new ManagedIndexRunner entries showing up in the logs. My team is looking to do a clean stop and rolling start of services at the earliest opportunity to see if a clean start of services “resolves” the issue.

In a separate environment(single node Opensearch instance), only one large index in particular had no ManagedIndexRunner rollover/transition evaluation checks being shown while all other indices had check entries and were rolling over just fine. In this environment, after a manual rollover of the problem index was performed, the ManagedIndexRunner entries started showing up for this index.

I have not found any bug reports of this version of Opensearch describing this type of issue and I am having difficulty finding a common thread as to why we are encountering these situations where sometimes just one index and sometimes most indices on an environment stop having rollover/transition evaluations occur on the running opensearch instance(s). If anyone has seen this behavior or could provide guidance/insight, it would be greatly appreciated. Happy to provide any further information that I can to aid in troubleshooting.

Configuration:

{
    "id": "policy_standard_rollover",
    "seqNo": 0,
    "primaryTerm": 1,
    "policy": {
        "policy_id": "policy_standard_rollover",
        "description": "Standard Rollover Policy",
        "last_updated_time": 1713902386945,
        "schema_version": 18,
        "error_notification": null,
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "rollover": {
                            "min_size": "50gb"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm"
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "replica_count": {
                            "number_of_replicas": 1
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "30d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "dataindex_primarymetric-*",
                    "dataindex_secondarymetric-*",
                    "dataindex_tertiarymetric-*"
                ],
                "priority": 1,
                "last_updated_time": 1713902386945
            }
        ]
    }
}

Relevant Logs or Screenshots:

samuelp · January 24, 2025, 8:56pm

As a delayed update to this post, restarting the services appears to have caused the ManagedIndexRunner to begin once more starting rollover and transition attempts, but it is not clear why this keeps reoccurring.

Topic		Replies	Views
Lots of o.o.i.i.ManagedIndexRunner Executing attempt_transition_step messages OpenSearch	4	1670	July 2, 2023
ISM Policy rollover Index Management troubleshoot	13	5176	March 9, 2022
Stuck at "Attempting to rollover" Index Management	11	2887	June 29, 2020
Understand why a new index was rolled out Index Management	16	1664	April 23, 2020
Debugging rollover failure on AWS cluster Index Management	3	749	December 18, 2020

ManagedIndexRunner stops rollover and transition attempts

Related topics