Snapshot management policy snapshoting failed without latest_execution.info field

Snapshot started to failed since 10 days, without any logs to help. Nothing changed concerning the SM policy configuration.

{
  "policies": [
    {
      "name": "snapshot-abc",
      "creation": {
        "current_state": "CREATION_START",
        "trigger": {
          "time": 1712253600000
        },
        "latest_execution": {
          "status": "FAILED",
          "start_time": 1712239399209,
          "end_time": null
        }
      },
      "deletion": {
        "current_state": "DELETION_START",
        "trigger": {
          "time": 1712286000000
        },
        "latest_execution": {
          "status": "FAILED",
          "start_time": 1712199799200,
          "end_time": null
        }
      },
      "policy_seq_no": 63243851,
      "policy_primary_term": 243,
      "enabled": true
    }
  ]
}

Any idea? My only guess is that the ism config is not triggered, due to this log

[2024-04-04T09:51:10,198][INFO ][o.o.j.s.JobSweeper       ] [es-main-master-1] Error while sweeping shard [.opendistro-ism-config][0], error message: all shards failed

But the shard status is green…

Any help would be appreciated!

I solved the error the hard way by first deleting the index .opendistro-ism-config and then reconfigure the SM policy.

/!\ This action removes every ISM you could have configured until then on your OpenSearch cluster (every action triggered by the JobScheduler).

My intuition is that the JobScheduler couldn’t validate the configurations stored in the index .opendistro-ism-config. I guess I could have cleaned it manually instead of deleting everything. The issue is clearly linked to the presence of a corrupted document within the index that leads to JobScheduler error.

On the other hand, I still don’t understand what caused the index to be corrupted.

I had the exact same issue on my preprod cluster and found the root cause of the issue.

TLDR; The user used to create Snapshot Management Policy did not exist anymore leading to the issue.

The error was indeed caused by a corrupted document in the .opendistro-ism-config index, more specifically by the document describing the Snapshot Management Policy that I configured previously, say MY_SMP.

To find this document

GET .opendistro-ism-config/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {
          "sm_policy.name": "MY_SMP"
        }}
      ]
    }
  }
}

The returned document contained a field user that mentioned an old user that I deleted previously.

To solve the error, delete the document

DELETE /.opendistro-ism-config/_doc/<MY_SMP_DOC_NAME>?timeout=5m

And recreate the Snapshot Management Policy with an available user.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.