OpenSearch3.1, performance very slow when taking snapshot

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch/Dashboard 3.1.0

Describe the issue:
When taking a snapshot, the cluster performance become very slow.
during this time, observe heap.percent is around 50% to 60%;
however, ram.percent always at 99% or 100%.

check pod log, see many records relate to memory insufficient.
such issue doesn’t exist in opensearch 2.19

Configuration:

    snapshotRepositories:
        # ceph
        - name: CCEE_EUDE1_CEPH_S3_ISMPOLICY
          type: s3
          settings:
            bucket: ssdl-logging-opensearch-s3interface-snapshot-ismpolicy
            region: eu-de-1
            client: eude1ceph
            disable_chunked_encoding: "true"
            compress: "true"
            storage_class: "standard"

Relevant Logs or Screenshots:

[2025-07-15T03:45:21,689][WARN ][o.o.t.NativeMessageHandler] [ssdl-app-logging-opensearch-data-2] handling inbound transport message [InboundMessage{Header{NATIVE}{121388}{3.1.0}{860586}{true}{false}{false}{false}{indices:data/write/bulk[s]}}] took [15337ms] which is above the warn threshold of [5000ms]
[2025-07-15T03:45:48,699][WARN ][o.o.t.TransportService   ] [ssdl-app-logging-opensearch-data-2] Received response for a request that has timed out, sent [25013ms] ago, timed out [0ms] ago, action [internal:coordination/fault_detection/leader_check], node [{ssdl-app-logging-opensearch-manager-0}{tE7UkyCwTqqBihmjPC96zQ}{yI78U3BATJyXgMn8zSkM9w}{100.104.8.73}{100.104.8.73:9300}{m}{shard_indexing_pressure_enabled=true}], id [3640105]
[2025-07-15T03:45:48,720][INFO ][o.o.c.s.ClusterApplierService] [ssdl-app-logging-opensearch-data-2] removed {{ssdl-app-logging-opensearch-data-3}{F1Nk9N8BSqayno0AioJmRg}{Sj8eCMaCR1C70Dd75uXaFg}{100.104.10.129}{100.104.10.129:9300}{d}{shard_indexing_pressure_enabled=true}}, term: 99, version: 116005, reason: ApplyCommitRequest{term=99, version=116005, sourceNode={ssdl-app-logging-opensearch-manager-0}{tE7UkyCwTqqBihmjPC96zQ}{yI78U3BATJyXgMn8zSkM9w}{100.104.8.73}{100.104.8.73:9300}{m}}

[2025-07-15T03:46:46,719][WARN ][i.n.c.AbstractChannelHandlerContext] [ssdl-app-logging-opensearch-data-2] An exception 'OpenSearchSecurityException[The provided TCP channel is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: SSLHandshakeException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: BadPaddingException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)];' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:

Hi @latituder ,

From Opensearch 2.19 → 3.1.0 a lot has been updated and changed, so usage being different in many ways would couple with them changes.

Looking at what you have provided I can see that heap is hitting 60% and ram is always maxed out. However without knowing more it is hard to say that adding more ram to each node is what is required to resolve your issues you’re seeing.

So lets start with gathering some more information

  • Could you check Garbage collection and see what kind of information you’re seeing?

  • How big are the snapshots, and have they been growing in size?

  • Have you tested generating the snapshots locally and not to s3 and seen if you’re seeing similar results?

  • When you say “the cluster performance become very slow” could you elaborate further?

    • What is it that becomes slow?
    • Have the resources available changed there?
    • Is there anything else running there?
    • Where is Opensearch running? (Docker, Kubernetes, VM)

Leeroy.