Snapshots Deletion Triggered S3 Rate Limiting and Subsequent Snapshot Failures

2.19.0 (relevant - OpenSearch/Dashboard/Server OS/Browser):

Describe the issue:
We’re experiencing a recurring issue with OpenSearch snapshots getting stuck in the IN_PROGRESS state.

Context:

  • We use OpenSearch ISM to create daily snapshots to an S3-backed repository.
  • The snapshot policy uses "*" to include all indices and runs once per day.
  • Snapshot deletion, is hitting S3 rate limit, causing the process to be stuck.

Configuration:

  "policies": [
    {
      "_id": "daily-policy-1-sm-policy",
      "_seq_no": 42620428,
      "_primary_term": 90,
      "sm_policy": {
        "name": "daily-policy-1",
        "description": "Daily snapshot policy at 1 AM PST",
        "schema_version": 21,
        "creation": {
          "schedule": {
            "cron": {
              "expression": "0 1 * * *",
              "timezone": "America/Los_Angeles"
            }
          },
          "time_limit": "1h"
        },
        "deletion": {
          "schedule": {
            "cron": {
              "expression": "0 0 * * *",
              "timezone": "America/Los_Angeles"
            }
          },
          "condition": {
            "min_count": 7,
            "max_count": 30
          }
        },
        "snapshot_config": {
          "indices": [
            "*"
          ],
          "ignore_unavailable": true,
          "include_global_state": false,
          "name": "daily-{now/d}",
          "repository": "daily_snapshot_1",
          "partial": false
        },
        "schedule": {
          "interval": {
            "start_time": 1745875413734,
            "period": 1,
            "unit": "Minutes"
          }
        },
        "enabled": true,
        "last_updated_time": 1751925866497,
        "enabled_time": 1751925866497
      }
    },

Relevant Logs or Screenshots:

deleting snapshots [daily-policy-1-2025-05-15t08:00:40-8vcvxmzn] from repository [daily_snapshot_1][2025-07-03T07:01:22,386][WARN ][o.o.r.b.BlobStoreRepository] [os-fileingest-master-3.prod.mw.int] [daily_snapshot_1] 
Exception during single stale index delete java.lang.RuntimeException: 

java.util.concurrent.CompletionException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, 
        at org.opensearch.repositories.s3.S3BlobContainer.getFutureValue(S3BlobContainer.java:400) ~[?:?]        at org.opensearch.repositories.s3.S3BlobContainer.delete(S3BlobContainer.java:380) ~[?:?]
        at org.opensearch.repositories.blobstore.BlobStoreRepository.deleteContainer(BlobStoreRepository.java:2280) ~[opensearch-2.19.0.jar:2.19.0]        at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$executeOneStaleIndexDelete$45(BlobStoreRepository.java:2245) [opensearch-2.19.0.jar:2.19.0]
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74) [opensearch-2.19.0.jar:2.19.0]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-2.19.0.jar:2.19.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1014) [opensearch-2.19.0.jar:2.19.0]        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.19.0.jar:2.19.0]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]

@Mai since the rate limiting on S3 is per partition, have you tried to separate the snapshots into a smaller (list of indices) and create separate partitions and ISM policies to manage them.

Also snapshotting with “*” captures a lot of system indices that you will probably never attempt to restore. It is a better approach to select the indices you actually would like to snapshot (using wildcards to capture groups of indices where necessary, eg products-* )

Yes we are looking into that option as well. At present our indices don’t have a pattern where we can group them.
Do you think, since the delete operation is preventing the subsequent snapshots, do remove the deletion policy from the ISM, and have a external script delete older snapshots? We will hit rate limit error as well, but we can add retry.

@Mai this would work, but you would be applying to “patch”, which could still have issues if the underlying problem with snapshots is not resolved.