Kubernetes Opensearch operator scaling issue (what is banzaicloud.com/last-applied ?)

jameskim · September 9, 2025, 12:40am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser)

Operator : 2.8.0
Opensearch : 3.2.0
Kubernetes Environment
- EKS 1.30
- AWS

Describe the issue

Hi. I`m running opensearch cluster with opensearch operator.

There are two issues when changing the Nodepool replicas.

[Issue #1***] When increasing the node pool replicas, the existing pods also restart.***

(Test : data nodepool replicas 3 → 6)

When the number of OpenSearch node pool replicas changes,

a new ControllerRevision is also created.


❯ k get pods | grep data
opensearch-data-0                        1/1     Running    0          16m
opensearch-data-1                        1/1     Running    0          14m
opensearch-data-2                        1/1     Running    0          5m25s
opensearch-data-3                        0/1     Init:0/2   0          65s
opensearch-data-4                        0/1     Init:0/2   0          65s
opensearch-data-5                        0/1     Init:0/2   0          65s

kubectl get controllerrevision | egrep -e 'statefulset.apps/opensearch-data|NAME'
NAME                           CONTROLLER                           REVISION   AGE
opensearch-data-58bc78989d     statefulset.apps/opensearch-data     28         88s
opensearch-data-84788b946b     statefulset.apps/opensearch-data     27         20m

After that, the existing pods are also restarted.

kubectl get pods -L controller-revision-hash | egrep -e 'data|NAME'

NAME                                     READY   STATUS    RESTARTS   AGE     CONTROLLER-REVISION-HASH

NAME                                     READY   STATUS     RESTARTS   AGE     CONTROLLER-REVISION-HASH

# Existing Pods (Why this restarting ?)
opensearch-data-0                        1/1     Running    0          5m31s   opensearch-data-5dd76687b6
opensearch-data-1                        1/1     Running    0          3m43s   opensearch-data-5dd76687b6
opensearch-data-2                        1/1     Running    0          118s    opensearch-data-5dd76687b6

# New Pods (Re - Starting because controller hash changed)
opensearch-data-3                        0/1     Init:0/2   0          4s      opensearch-data-5dd76687b6
opensearch-data-4                        1/1     Running    0          8m39s   opensearch-data-58bc78989d
opensearch-data-5                        1/1     Running    0          8m39s   opensearch-data-58bc78989d

If anyone knows, please help me. Is this normal behavior?

If not, is there a way to prevent it?

Or, How to troubleshoot this ?

kubectl get controllerrevision opensearch-data-84788b946b -n opensearch-prod -o yaml > old.yamlkubectl get controllerrevision opensearch-data-5dd76687b6 -n opensearch-prod -o yaml > new.yaml

diff -u old.yaml new.yaml

banzaicloud.com/last-applied

[Issue #2**] When reducing the OpenSearch node pool replicas,
the SmartScaler does not work properly.**

(Test : data nodepool replicas 6 → 3)

# kubectl get opensearchclusters opensearch -o jsonpath="{.status}"
{
  "availableNodes": 10,
  "componentsStatus": [
    {
      "component": "Restarter",
      "status": "InProgress"
    },
    {
      "component": "Scaler",
      "description": "data",
      "status": "Excluded"
    }
  ],
  "health": "yellow",
  "initialized": true,
  "phase": "RUNNING",
  "version": "3.2.0"
}

# kubectl get pods
opensearch-data-0                        1/1     Running   0          2m8s
opensearch-data-1                        1/1     Running   0          13m
opensearch-data-2                        1/1     Running   0          11m


# kubectl get opensearchclusters opensearch -o jsonpath="{.status}"
{
  "availableNodes": 9,
  "componentsStatus": [
    {
      "component": "Restarter",
      "status": "InProgress"
    }
  ],
  "health": "red",
  "initialized": true,
  "phase": "RUNNING",
  "version": "3.2.0"
}

Since SmartScaler is enabled, I expected it to properly migrate the shards from the data nodes being scaled down before reducing them, but it seems this is not happening correctly.

[DEV Tools]

# GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state
index                            shard prirep state      node              unassigned.reason

security-auditlog-2025.09.08     0     p      UNASSIGNED                   NODE_LEFT
security-auditlog-2025.09.08     0     r      UNASSIGNED                   NODE_LEFT
.kibana_1                        0     p      UNASSIGNED                   NODE_LEFT
.kibana_1                        0     r      UNASSIGNED                   NODE_LEFT
security-auditlog-2025.08.30     0     p      UNASSIGNED                   NODE_LEFT
security-auditlog-2025.08.30     0     r      UNASSIGNED                   NODE_LEFT
top_queries-2025.09.02-00378     0     p      STARTED    opensearch-data-1 
top_queries-2025.09.02-00378     0     r      STARTED    opensearch-data-2 


# GET _cluster/settings
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "all"
        }
      }
    }
  }

Is this the intended behavior, or could there be something misconfigured on my side?

Configuration

The reason I set drainDataNodes: false is that the PVC volumes already exist, and I wanted to prevent shard relocation during a simple restart.

spec:
  ..
  ..
  confMgmt:
    smartScaler: true

  general:
    ..
    drainDataNodes: false
..
..
  nodePools:
  - additionalConfig:
      plugins.security.audit.config.enable_rest: "false"
      plugins.security.audit.config.enable_transport: "false"
      plugins.security.enable_snapshot_restore_privilege: "false"
      plugins.security.ssl_cert_reload_enabled: "true"
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              opster.io/opensearch-nodepool: data
          topologyKey: kubernetes.io/hostname
    annotations:
      ad.datadoghq.com/opensearch.checks: |
        {
          "elastic": {
            "init_config": {},
            "instances": [
              {
                "tls_verify": false,
                "url": "https://%%host%%:9200",
                "username": "ENC[k8s_secret@opensearch-prod/admin-credentials-secret/username]",
                "password": "ENC[k8s_secret@opensearch-prod/admin-credentials-secret/password]",
                "index_stats": "true",
                "pshard_stats": "true",
                "cat_allocation_stats": "true",
                "pending_task_stats": "true"
              }
            ]
          }
        }
    component: data
    diskSize: 1000Gi
    env:
    - name: DISABLE_INSTALL_DEMO_CONFIG
      value: "true"
    nodeSelector:
      karpenter.sh/nodepool: opensearch-nodepool
    pdb:
      enable: true
      maxUnavailable: 1
    persistence:
      pvc:
        accessModes:
        - ReadWriteOnce
        storageClass: ebs-gp3
    replicas: 3
    resources:
      limits:
        memory: 20Gi
      requests:
        cpu: 3000m
        memory: 20Gi
    roles:
    - data
    - ingest
    tolerations:
    - effect: NoSchedule
      key: karpenter.sh/nodepool
      operator: Equal
      value: opensearch-nodepool

Relevant Logs or Screenshots:

Thanks !

Anthony · September 10, 2025, 9:24am

@jameskim The restart is a known issue, I would recommend to raise a bug here, this happens because the configurations gets reshuffled and is therefore seen as a change that needs to be reapplied.

Regarding the graceful scale down, both of the settings (smartScaler and drainDataNodes) need to be set to true, otherwise the Scaler component is skipped.

jameskim · September 11, 2025, 1:14am

Hi! thanks for replying.

Pod was restarting because of RecoveryMode, so I’m going to disable this setting If not necessary. (I’m going to see the source code more)
The reason SmartScaler doesn’t work properly is because there is a bug in the code (When using CatShards API)
So i will modify source code myself and use this.
DrainDataNode disabled. Because PVC is preserving data anyway,
so I turned it off because the shards should not have to relocate when pod restarts.

Topic		Replies	Views
Opensearch Operator scale down issues OpenSearch configure , install , index-management	4	275	May 1, 2025
Opensearch ALL Data node restart after increasing the number of replicas OpenSearch troubleshoot	0	30	August 30, 2025
Restarting Opensearch Corrupts Cluster OpenSearch discuss , troubleshoot	1	1994	January 25, 2023
Opensearch nodePool Data nodes Not getting Initialized and change to Running state OpenSearch install	4	319	October 23, 2024
Pause cluster deployed by opensearch-k8s-operator OpenSearch	6	153	October 11, 2024

Kubernetes Opensearch operator scaling issue (what is banzaicloud.com/last-applied ?)

Related topics