Unassigned shards after killed containers (blackout)

cinimins · November 24, 2023, 5:06pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch 2.11, deployed with official Helm chart in Kubernetes

Describe the issue:

I have a three node OpenSearch setup in Kubernetes. I created one index that I write into at 2am every night, nothing else is happening on the cluster. One day at 5pm we had a blackout and all pods went down immediately. When the Kubernetes cluster came back on, one container would not start:

{"type": "server", "timestamp": "2023-11-24T14:52:02,526Z", "level": "ERROR", "component": "o.o.b.OpenSearchUncaughtExceptionHandler", "cluster.name": "os", "node.name": "os-mngr-1", "message": "uncaught exception in thread [main]", 
"stacktrace": ["org.opensearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/opensearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?",
...

Usually, I would delete the PVC/disk of that pod, restart it and everything would be running fine (because my index has two replicas). This time I was trying out a more gentle approach, that I eventually want to automate in the Helm chart: deleting the locks before starting the container.

So I deleted the following two files and the container would start without any warnings/errors:

/usr/share/opensearch/data/nodes/0/node.lock
/usr/share/opensearch/data/nodes/0/_state/write.lock

While the brutal approach with deleting the entire disk works like a charme, deleting the lock files leaves some of my indices in a yellow state.

GET _cat/allocation?v shows me that there are 4 unassigned shards:

shards disk.indices disk.used disk.avail disk.total disk.percent node
    18       43.8mb    45.6mb      1.9gb        2gb            2 os-mngr-0
     1         208b    40.3mb      1.9gb        2gb            1 os-mngr-1
    18       40.1mb    41.9mb      1.9gb        2gb            2 os-mngr-2
     4                                                           UNASSIGNED

But my cluster settings are pretty much all on default (such as cluster.routing.allocation.enable):

GET _cluster/settings

{
  "persistent": {
    "plugins": {
      "index_state_management": {
        "template_migration": {
          "control": "-1"
        }
      }
    }
  },
  "transient": {}
}

I would expect that by recovering the node (by removed the lock-files) to show all my indices in a green state, or, if it turns out the indices are not up-to-date or borken, to sync it with the other two nodes.

Is my approach not working? Is deleting the disk my only option in this case?

Configuration:

Pretty vanilla configuration through helm

Relevant Logs or Screenshots:

Topic		Replies	Views
Failed to obtain node locks, tried [[/opt/opensearch/data]] with lock id [0] OpenSearch troubleshoot	7	2019	February 6, 2024
.opendistro-job-scheduler-lock and .opendistro-ism-config are red OpenSearch	0	618	August 5, 2024
Opensearch Observability, Security indices go unassigned after maintenance OpenSearch troubleshoot	2	804	May 26, 2023
Cause of UnassignedShards and NodeDisconnectedException Errors OpenSearch	0	14	March 11, 2025
Un-Assigned Shards after the restart of the ES cluster on Kubernetes Open Source Elasticsearch and Kibana releases	2	871	February 7, 2022

Unassigned shards after killed containers (blackout)

Related topics