Performance Issues while Snapshot is taken

shs_tech · September 17, 2025, 5:06am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 3.2.0

Describe the issue: We recently switched to Opensearch 3.2.0 which is deployed in k8s cluster using Helm Chart. It is observed that while the snapshot policy initiates a snapshot to azure blob, the entire dashboard operations become very slow & we see logs of opensearch rejections from fluentd. Literally we were unable to execute any search queries during the time. Then the snapshot completed in Partial state after 32 mins. One of the sample failure is given below

{ “index”: “xyz”, “index_uuid”: “xyz”, “shard_id”: 3, “reason”: “node shutdown”, “node_id”: “xyz”, “status”: “INTERNAL_SERVER_ERROR” }

I’ve also checked the cat/threadpool/snapshot?v and the output is given below.

node_name name active queue rejected
opensearch-master-1 snapshot 0 0 0
opensearch-master-2 snapshot 0 0 0
opensearch-master-4 snapshot 0 0 0
opensearch-client-1 snapshot 0 0 0
opensearch-master-0 snapshot 0 0 0
opensearch-data-0 snapshot 5 392 0
opensearch-client-4 snapshot 0 0 0
opensearch-client-3 snapshot 0 0 0
opensearch-data-1 snapshot 5 258 0
opensearch-data-2 snapshot 5 280 0
opensearch-client-0 snapshot 0 0 0
opensearch-client-2 snapshot 0 0 0
opensearch-master-3 snapshot 0 0 0
opensearch-data-4 snapshot 5 294 0
opensearch-data-3 snapshot 5 269 0

I have tested the behaviour by switching back to v2.16.0 & there the snapshot is taken successfully within 3-4 mins. Anybody faced the issue/ any solution for this?

pablo · September 17, 2025, 7:55pm

@shs_tech Was 3.2.0 the only version that you’ve tested? Could you share your values.yml file?

How many snapshots do you run at the same time?
Do you get this issue with any index?

shs_tech · September 18, 2025, 3:29am

@pablo I started observing this behaviour from v3.0.0 . Our policy takes one snapshot every half an hour. I don’t think the issue is with a specific index as i see majority of them were in the failed state. The threadpool stats show lot of docs in queue for the data nodes.

Our values.yml has multiple overrides. Could you share the specific settings you want, so that i can share the relevant details.

I am suspecting any changes in the defaults in v3.x is causing the problem as it is working fine with 2.16 version. Any assistance on this is highly appreciated

pablo · September 24, 2025, 7:48pm

Could you elaborate on this one? Which area are these overrides related?

pablo · September 25, 2025, 12:53pm

@shs_tech Did the comment from the other case solved your performance issue?

shs_tech · September 25, 2025, 1:15pm

Hi @pablo

Yes. Setting the JVM flag related to Lucene has fixed my issue. The IOPS seems to be stable now & snapshots are really quick. I think this change/feature should be well documented

Topic		Replies	Views
OpenSearch3.1, performance very slow when taking snapshot OpenSearch	1	131	July 15, 2025
Opensearch Replication & Recovery performance issue. tooks so long (100GiB -> 2 ~ 3 Hours) OpenSearch troubleshoot	3	155	September 25, 2025
OpenSearch searchable snapshot Open Source Elasticsearch and Kibana troubleshoot , configure , feature-request , index-management	5	309	March 14, 2025
Remove/create snapshot error OpenSearch troubleshoot	0	67	June 9, 2025
Snapshot repository timeouts after converting to Opensearch OpenSearch	4	1237	June 2, 2022

Performance Issues while Snapshot is taken

Related topics