Snapshots Tipping Over Data Nodes

Howdy!

We’ve noticed a peculiar issue where, rather sporadically, data nodes drop out from our cluster when our nightly snapshot policy executes. The interesting tidbit is that this is only happening for one out of a handful of clusters.

Currently, we haven’t gleaned much from the OpenSearch logs. The only noticeable errors being GC did not bring memory usage down and failed to list shard. However, if are contributing to the issue, we’re not entirely sure how to rectify the problem and why it’s only happening in one of our clusters.

From,
DH

We are seeing the same thing happening where the backup is starting and then data nodes are dropping and then we have indexes that are unavailable.

@BurntBaboon ,
I have also experienced this issue with snapshots. This happened because the Java Heap went out of memory when taking snapshots. To confirm if it is the same issue that you are experiencing, could you please provide the following information?

  1. Can you share the response body of Get Snapshot - OpenSearch Documentation API call?. Need to check the failures and shards fields as it will provide a clue

  2. Do you see java.lang.OutOfMemoryError: Java heap space errors in the OpenSearch data node logs? This should get printed after the GC did not bring memory usage down errors