Snapshot restore slows down data ingestion on cluster

adrian · January 11, 2023, 6:53am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

v 1.3.2
3 master nodes( AWS t3a.medium), 3 data nodes (AWS r6i.2xlarge) , running on EC2 instances in AWS.
Standard setup with beats sending docs to redis, and logstash ingesting docs into opensearch
Ingesting approx 300,000 records a minute

Describe the issue:
When I do a snapshot restore (i.e. restoring 3 indices (approx 140Gb in size, 53m docs) while the snapshot is restoring I notice that the cluster slows down and a back log of docs build up in redis.
Generally each index consists of 3 shards.

Configuration:
Below is my settings, everything else is default.

{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      },
      "max_shards_per_node" : "2000"
    },
    "plugins" : {
      "index_state_management" : {
        "metadata_migration" : {
          "status" : "1"
        },
        "template_migration" : {
          "control" : "-1"
        }
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "rebalance" : {
          "enable" : "all"
        },
        "allocation" : {
          "node_concurrent_recoveries" : "10",
          "enable" : "all",
          "node_initial_primaries_recoveries" : "10"
        }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "100mb"
      }
    }
  }
}

I am aware that I have replaced the default value of max_bytes_per_sec from 40mb to 100mb, but I can’t imagine that is the main reason (I will try some restores with the default value)

Relevant Logs or Screenshots:
Looking at the infra, I do see some interesting things.
I can see writes going to the EBS disks (FYI we have one 2TB EBS GP3 disk attached to each data node) at approx 100Mb which lines up with the setting above. The graph below shows this. a restore done yesterday and one this morning.

.

The CPU on the data nodes are ok:

The CPU on the master nodes are ok:

The network traffic coming into the data nodes was just aove 6GB a minute:

So I am not seeing any resource issues, my cluster looks ok

NB - With additional testing, even a 1 index restore approx 40Gb size 8.5m docs causes the cluster to slow down and causes redis to start to fill up.

I changed the max_bytes_per_sec back to 40mb and tried a 40GB index restore, and this one seemed ok. I tried to restore 7 indices of a similar size, during this time it appears the cluster was relatively up to date ( I think it feel behind max 2 minutes, but caught up relatively quickly)

Does the changing of max_bytes_per_sec really affect the cluster that much? Anybody else have similar experiences.

adrian · January 12, 2023, 2:22am

Just to add a little more information.
With the max_bytes_per_sec set to 40mbs, I am still seeing the cluster slow down, no as bad as when max_bytes_per_sec was set to 100mbs.

The below shows a sharp decline in events being ingested. That decline is when I initiated the restore. This time I was restoring 8 indices, between 10 and 25Gb in size.

During this time I can see redis starts to fill up.

I wonder if this is related to initializing shards. I can see that during this time the cluster is initializing shards

{
  "cluster_name" : "my-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "active_primary_shards" : 2676,
  "active_shards" : 5354,
  "relocating_shards" : 0,
  "initializing_shards" : 24,
  "unassigned_shards" : 22,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 1,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.14814814814815
}

Topic		Replies	Views
Expectation of cluster rebuild time for OpenSearch OpenSearch discuss	2	659	February 11, 2022
Searchable snapshots and initial cache size Index Management configure	1	24	April 11, 2025
Opensearch ingestion is slow and timeouts are occuring very frequently OpenSearch	11	310	January 20, 2025
[Solved] Is there document or any reference for tuning the recovery performance? OpenSearch configure	4	1349	April 28, 2022
Index size spikes during snapshotting OpenSearch troubleshoot	2	141	April 15, 2024

Snapshot restore slows down data ingestion on cluster

Related topics