Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 3.2.0
Describe the issue: We recently switched to Opensearch 3.2.0 which is deployed in k8s cluster using Helm Chart. It is observed that while the snapshot policy initiates a snapshot to azure blob, the entire dashboard operations become very slow & we see logs of opensearch rejections from fluentd. Literally we were unable to execute any search queries during the time. Then the snapshot completed in Partial state after 32 mins. One of the sample failure is given below
{ “index”: “xyz”, “index_uuid”: “xyz”, “shard_id”: 3, “reason”: “node shutdown”, “node_id”: “xyz”, “status”: “INTERNAL_SERVER_ERROR” }
I’ve also checked the cat/threadpool/snapshot?v and the output is given below.
node_name name active queue rejected
opensearch-master-1 snapshot 0 0 0
opensearch-master-2 snapshot 0 0 0
opensearch-master-4 snapshot 0 0 0
opensearch-client-1 snapshot 0 0 0
opensearch-master-0 snapshot 0 0 0
opensearch-data-0 snapshot 5 392 0
opensearch-client-4 snapshot 0 0 0
opensearch-client-3 snapshot 0 0 0
opensearch-data-1 snapshot 5 258 0
opensearch-data-2 snapshot 5 280 0
opensearch-client-0 snapshot 0 0 0
opensearch-client-2 snapshot 0 0 0
opensearch-master-3 snapshot 0 0 0
opensearch-data-4 snapshot 5 294 0
opensearch-data-3 snapshot 5 269 0
I have tested the behaviour by switching back to v2.16.0 & there the snapshot is taken successfully within 3-4 mins. Anybody faced the issue/ any solution for this?