Restarting Opensearch Corrupts Cluster

hello opensearch community,

There’s an issue that my team and I are coming across frequently and I wanted to run it by you here and see if there’s a known solution.

We’re running opensearch (v1.2.4) on kubernetes cluster, which undergoes patching and upgrades very frequently (weekly basis).
The issue is, every time the cluster undergoes patching, all the pods(similar to VMs) are forcefully shutdown and then started back up again.
This causes our opensearch cluster to be in a corrupted state and let me explain what I mean.

When the data and master nodes are started back up (after being forcefully shutdown), they try forming a cluster using existing data and cluster id.
However, what ends up happening is one of the following:

-the data and master nodes end up forming an entirely new cluster (opensearch creates new folders under /data/nodes, such as folder “3”, “4” and “5”) and disregard any previously stored data.
-the cluster recovers fine with a green cluster state
-the cluster recovers and stays in red cluster state (and it appears there are unassigned shards)

My guess is that, opensearch operates similar to a database, in which, it needs a proper shutdown and startup command to be executed, rather than forcefully shutting down the opensearch process.

Here’s what I found in documentations for “Full-cluster restart and rolling restart” and we’re considering following this procedure: (Full-cluster restart and rolling restart | Elasticsearch Guide [8.6] | Elastic)
-disable shard allocation
-run a flush command (which I think stops indexing)
-shutdown all nodes, using one of these methods:

  • sudo systemctl stop elasticsearch.service
  • kill $(cat pid)
    -start master nodes, followed by data nodes
    -re enable shard allocation

Another question I had is, does it make sense for opensearch to have its own server, so that it’s not deployed alongside web applications deployed on kubernetes cluster, which undergoes regular upgrades?

Thank you for your input,
Firdaus

I am going to very strongly recommend migrating to the OpenSearch Operator. You can manage this manually like you have pointed out but there are a lot of steps. The operator does the heavy lifting for you.