Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Describe the issue:
I have installed opensearch in kubernetes cluster using our own helm chart.
I have 2 data pods, 3 master pods and one ingest pod.
I have done force deletion on all the pods by using “kubectl delete pods -n sample --all --force --grace-period=0”. After this, data pod is not ready even after 20 min.
But getting error with lock id.
Can anyone help to find the location of it and if there is anyway to remove the lock id without getting any error?
Relevant Logs or Screenshots:
“org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/opt/opensearch/data]] with lock id ; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was )?”
It seems that multiple nodes are using the same mounted volume, you check that configuration.
@YaswaniT Did you delete the PVC after deleting the pods?
No. I should not delete pvc. I have few indices in pvc. As part of the reboot, i had to force delete pods. Is there any other possible way to remove the lock?
@YaswaniT Forcing the deletion of the pod and not allowing the service to stop gracefully could cause the reported issue. If you take a look at the data of the related PVC, you’ll find a file
This file should be removed whenever you stop the service gracefully. Because you’ve forced it, the file remained.
Do you have snapshots/backups of those indices?
I’ve reproduced your issue. When I removed the
node.lock file, the OpenSearch pod started again.
@YaswaniT Why do you force deletion of the pod? Graceful deletion will also work and prevent node.lock issues.
Hi @pablo ,
We have a testcase of force deleting the pods to check if it works without error. Now as this lock issue happened due to resilience issue(force delete or node down or power down), we are trying to find a solution.
So by removing node.lock file from the “<PV_data_folder>/nodes/0/”, will the opensearch work?
Hi @pablo , i removed the node.lock and i restarted the failure pod manually as it didnot restart on its own. Now i didnt get the issue.
Thanks for your help.