Opensearch Operator scale down issues

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch: 2.19.0

Opensearch-operator: latest

Describe the issue:
The opensearch cluster has been set up using the opensearch operator mainly to handle capabilities like efficient scaling, especially draining, relocation and graceful deletion of nodes during scale down.

Further, The smartscaler is also enabled in opensearch-cluster.yaml.
When testing the scale down operation, below issues were found

  1. intermittently the cluster goes to red state because when scaling down the data nodes, few indices shards isnt drained or re-allocated properly.

  2. This could be found often with system indices, ml-plugin indices and any other if it has only 1 primary and 1 replica

Is there anyway to fix this issue so that operator handles the scaling flawlessly? or is this something being triaged currently.
Please help.

Configuration:

The cluster using operator is installed on GKE.
the data nodes are scaled often based on TPS.

1 Like

@Nagpraveen Could you share your full opensearchclusters manifest?

@pablo : Good Day.
Below is the opensearch manifest via operator:

#Minimal configuration of a cluster with version 2.X of the operator.
#Note the replacement of 'master' role with 'cluster_manager' on line 49
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: opensearch-cluster
  namespace: default
  resourceVersion: "14866689"
spec:
  confMgmt:
    smartScaler: true
  security:
    config:
      adminSecret:
        name: opensearch-admin-tls
      adminCredentialsSecret:
        name: os-admin-credentials
    tls:
      http:
        generate: false
        secret:
          name: opensearch-node-tls
        caSecret:
          name: opensearch-ca
      transport:
        generate: false
        perNode: false
        secret:
          name: opensearch-node-tls
        caSecret:
          name: opensearch-ca
        nodesDn: ["****************************"]
        adminDn: ["****************************"]
  general:
    httpPort: 9200
    serviceName: opensearch-cluster
    version: 2.19.0
    setVMMaxMapCount: true
    pluginsList:
      - analysis-phonetic
      - mapper-murmur3
    drainDataNodes: true
    # additionalVolumes:
    #   - name: fileshare-pv
    #     path: /usr/share/opensearch/config/custom
    #     persistentVolumeClaim:
    #       claimName: fileshare-pvc
  dashboards:
    opensearchCredentialsSecret:
      name: os-admin-credentials
    version: 2.19.0
    enable: true
    replicas: 1
    nodeSelector:
      node-role: "monitor"
    tolerations:
      - key: "opensearch"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    resources:
      requests:
         memory: "4Gi"
         cpu: "2"
      limits:
         memory: "4Gi"
         cpu: "2"
  nodePools:
    - component: master
      replicas: 3
      jvm: -Xms4g -Xmx4g
      diskSize: 100Gi
      nodeSelector:
        node-role: "master"
      resources:
         requests:
            memory: "8Gi"
            cpu: "3"
         limits:
            memory: "8Gi"
            cpu: "3"
      roles:
        - "master"
        - "remote_cluster_client"
      # volumeMounts:
      #   - name: fileshare-pv
      #     mountPath: /usr/share/opensearch/config/custom
      # volumes:
      #   - name: fileshare-pv
      #     persistentVolumeClaim:
      #       claimName: fileshare-pvc
      persistence:
        pvc:
          accessModes:
            - ReadWriteOnce
          storageClass: "standard"
    - component: data
      replicas: 1
      jvm: -Xms60g -Xmx60g -XX:G1ReservePercent=10 -XX:InitiatingHeapOccupancyPercent=60 -XX:MaxGCPauseMillis=150 -XX:ConcGCThreads=14 -XX:ParallelGCThreads=24
      diskSize: 1500Gi
      nodeSelector:
        node-role: "data"
      tolerations:
        - key: "opensearch"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      resources:
         requests:
            memory: "230Gi"
            cpu: "29"
         limits:
            memory: "235Gi"
            cpu: "29"
      roles:
        - data
        - ingest
        - remote_cluster_client
      persistence:
        pvc:
          accessModes:
            - ReadWriteOnce
          storageClass: "pd-ssd"
    - component: ml
      replicas: 1
      jvm: -Xms12500m -Xmx12500m
      diskSize: 500Gi
      tolerations:
        - key: "opensearch"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      nodeSelector:
        node-role: "ml"
      resources:
         requests:
            memory: "25Gi"
            cpu: "30"
         limits:
            memory: "25Gi"
            cpu: "30"
      roles:
        - "ml"
        - "remote_cluster_client"
      persistence:
        pvc:
          accessModes:
            - ReadWriteOnce
          storageClass: "pd-ssd"

Hi @Nagpraveen ,
according to the given manifest, the number of replicas for each cluster_manager, data, and ML roles starts with one. If TPS increases then do you only scale-up/down data nodes not along with cluster_managers?
I believe if the shard replica is above 1, it seems natural that the data node becomes red when scaled down.

@yeonghyeonKo : Greetings!

  1. If you see the manifest, the number of master nodes is 3 not 1 as we understand operator requires minimum 3 nodes to perform manger role efficiently. the role is provided as master.

  2. Regarding data nodes, in the manifest we have provided as 1 initially, but we are extensively performing the scaling tests (scale up and scale downs) to test the stability of the cluster before even releasing any TPS.

So we ran few experiments scaling only data nodes as all the data.indices is mainly persisted on data nodes.

First we scaled the data nodes from,
1 → 3 = and cluster was green
then, 3 → 10 = still cluster was green, scale up successful
then, 10 → 3 = scaling down back to 3 made cluster status become red
then, 3 → 25 = scaling up to 25 , cluster became green,
then, 25 → 3 = scaling back from 25 to 3, cluster went to red status

Like this we ran this experiment multiple times with different combinations. But, every time we were scaling down to 3 and not less than 3 for HA purposes and also to give enough room for relocation.

Note, in each of the above cases, we always have minimum 3 data nodes, which should be good enough to drain and reallocate properly.

9 out of 10 times, cluster is going to red state meaning the draining functionality is very inconsistent.