Unassigned shards after killed containers (blackout)

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch 2.11, deployed with official Helm chart in Kubernetes

Describe the issue:

I have a three node OpenSearch setup in Kubernetes. I created one index that I write into at 2am every night, nothing else is happening on the cluster. One day at 5pm we had a blackout and all pods went down immediately. When the Kubernetes cluster came back on, one container would not start:

{"type": "server", "timestamp": "2023-11-24T14:52:02,526Z", "level": "ERROR", "component": "o.o.b.OpenSearchUncaughtExceptionHandler", "cluster.name": "os", "node.name": "os-mngr-1", "message": "uncaught exception in thread [main]", 
"stacktrace": ["org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/opensearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",

Usually, I would delete the PVC/disk of that pod, restart it and everything would be running fine (because my index has two replicas). This time I was trying out a more gentle approach, that I eventually want to automate in the Helm chart: deleting the locks before starting the container.

So I deleted the following two files and the container would start without any warnings/errors:


While the brutal approach with deleting the entire disk works like a charme, deleting the lock files leaves some of my indices in a yellow state.

GET _cat/allocation?v shows me that there are 4 unassigned shards:

shards disk.indices disk.used disk.avail disk.total disk.percent node
    18       43.8mb    45.6mb      1.9gb        2gb            2 os-mngr-0
     1         208b    40.3mb      1.9gb        2gb            1 os-mngr-1
    18       40.1mb    41.9mb      1.9gb        2gb            2 os-mngr-2
     4                                                           UNASSIGNED

But my cluster settings are pretty much all on default (such as cluster.routing.allocation.enable):

GET _cluster/settings

  "persistent": {
    "plugins": {
      "index_state_management": {
        "template_migration": {
          "control": "-1"
  "transient": {}

I would expect that by recovering the node (by removed the lock-files) to show all my indices in a green state, or, if it turns out the indices are not up-to-date or borken, to sync it with the other two nodes.

Is my approach not working? Is deleting the disk my only option in this case?


Pretty vanilla configuration through helm

Relevant Logs or Screenshots: