Performance Issues while Snapshot is taken

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 3.2.0

Describe the issue: We recently switched to Opensearch 3.2.0 which is deployed in k8s cluster using Helm Chart. It is observed that while the snapshot policy initiates a snapshot to azure blob, the entire dashboard operations become very slow & we see logs of opensearch rejections from fluentd. Literally we were unable to execute any search queries during the time. Then the snapshot completed in Partial state after 32 mins. One of the sample failure is given below

{ “index”: “xyz”, “index_uuid”: “xyz”, “shard_id”: 3, “reason”: “node shutdown”, “node_id”: “xyz”, “status”: “INTERNAL_SERVER_ERROR” }

I’ve also checked the cat/threadpool/snapshot?v and the output is given below.

node_name name active queue rejected
opensearch-master-1 snapshot 0 0 0
opensearch-master-2 snapshot 0 0 0
opensearch-master-4 snapshot 0 0 0
opensearch-client-1 snapshot 0 0 0
opensearch-master-0 snapshot 0 0 0
opensearch-data-0 snapshot 5 392 0
opensearch-client-4 snapshot 0 0 0
opensearch-client-3 snapshot 0 0 0
opensearch-data-1 snapshot 5 258 0
opensearch-data-2 snapshot 5 280 0
opensearch-client-0 snapshot 0 0 0
opensearch-client-2 snapshot 0 0 0
opensearch-master-3 snapshot 0 0 0
opensearch-data-4 snapshot 5 294 0
opensearch-data-3 snapshot 5 269 0

I have tested the behaviour by switching back to v2.16.0 & there the snapshot is taken successfully within 3-4 mins. Anybody faced the issue/ any solution for this?

@shs_tech Was 3.2.0 the only version that you’ve tested? Could you share your values.yml file?

How many snapshots do you run at the same time?
Do you get this issue with any index?

@pablo I started observing this behaviour from v3.0.0 . Our policy takes one snapshot every half an hour. I don’t think the issue is with a specific index as i see majority of them were in the failed state. The threadpool stats show lot of docs in queue for the data nodes.

Our values.yml has multiple overrides. Could you share the specific settings you want, so that i can share the relevant details.

I am suspecting any changes in the defaults in v3.x is causing the problem as it is working fine with 2.16 version. Any assistance on this is highly appreciated

Could you elaborate on this one? Which area are these overrides related?

@shs_tech Did the comment from the other case solved your performance issue?

Hi @pablo

Yes. Setting the JVM flag related to Lucene has fixed my issue. The IOPS seems to be stable now & snapshots are really quick. I think this change/feature should be well documented

1 Like