Snapshots Tipping Over Data Nodes

Howdy!

We’ve noticed a peculiar issue where, rather sporadically, data nodes drop out from our cluster when our nightly snapshot policy executes. The interesting tidbit is that this is only happening for one out of a handful of clusters.

Currently, we haven’t gleaned much from the OpenSearch logs. The only noticeable errors being GC did not bring memory usage down and failed to list shard. However, if are contributing to the issue, we’re not entirely sure how to rectify the problem and why it’s only happening in one of our clusters.

From,
DH

We are seeing the same thing happening where the backup is starting and then data nodes are dropping and then we have indexes that are unavailable.