Deepa
February 25, 2025, 4:12am
1
Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Current Version of Opensearch nodes and dashboard : 2.14.0
New version required for Opensearch nodes and dashboard : 2.19.0
Current Version of Opensearch Operator : 2.5.1
New version required for Opensearch Operator: 2.7.0
Describe the issue :
Am trying to upgrade my cluster to 2.19.0version for Opensearch cluster nodes and Dashboards, Opensearch operator to 2.7.0.
We have the cluster built using helm chart in GKE cluster.
AS per the current configuration, we have the docker image built for each of the above.
But issue am facing is, While upgrading the version , theres a downtime for opensearch operator ( as am doing the operator upgrade first)
After that, as per steps (in Documentation) if i need to restart node by node, its not possible. As my current configuration , docker image is mentioned as general configuration applicable to all nodes.
so the moment i make the change to new Image version , it will be synced in argoCD application and all nodes are pointed to new image including the master node.
Again Opensearch Dashboard Configuration is also included in the same values.yaml.
SO Dashboard becomes unavailable until all the nodes are restarted and taken the new version upgrade.
SO there’s sufficient high downtime period during this upgrade.
Configuration :
Configuration Attached.
Relevant Logs or Screenshots :
Deepa
February 25, 2025, 4:14am
2
Opensearch Configuration
==========================================================================
opensearchCluster:
enabled: true
clusterName: opensearch
confMgmt:
smartScaler: true
general:
#########################################################################################################
# GENERAL
#--------------------------------------------------------------------------------------------------------
# general section define docker image basically
# also include common setting in all nodes
#--------------------------------------------------------------------------------------------------------
httpPort: "9200"
version: 2.14.0
image: <path to hub repo>
imagePullPolicy: Always
imagePullSecrets: []
serviceName: "opensearch"
drainDataNodes: true
setVMMaxMapCount: false
dashboards:
enable: true
version: 2.14.0
image: <path to hub repo>
imagePullPolicy: Always
imagePullSecrets: []
replicas: 2
nodePools:
###master node
- component: master
labels:
lv_status: "BUILD"
lv_application: "OPENSEARCH"
lv_sla: "GOLD"
lv_contact: ""
replicas: 4
pdb:
enable: true
minAvailable: 2
diskSize: "30Gi"
jvm: -Xmx1600M -Xms1600M
roles:
- "cluster_manager"
resources:
requests:
memory: "3Gi"
cpu: "800m"
limits:
memory: "3Gi"
cpu: "800m"
persistence:
pvc:
storageClass: standard-rwo
accessModes:
- ReadWriteOnce
#####hot node
- component: hot
labels:
lv_status: "BUILD"
lv_application: "OPENSEARCH"
lv_sla: "GOLD"
lv_contact: ""
replicas: 4
pdb:
enable: true
minAvailable: 2
diskSize: "600Gi"
jvm: -Xmx4000M -Xms4000M
nodeSelector:
resources:
requests:
memory: "8Gi"
cpu: 3000m
limits:
memory: "8Gi"
cpu: 3000m
roles:
- "data"
- "ingest"
persistence:
pvc:
storageClass: premium-rwo
accessModes:
- ReadWriteOnce
##warmnode
- component: digital-warm
labels:
lv_status: "BUILD"
lv_application: "OPENSEARCH"
lv_sla: "GOLD"
lv_contact: ""
replicas: 4
pdb:
enable: true
minAvailable: 2
diskSize: "800Gi"
jvm: -Xmx4000M -Xms4000M
nodeSelector:
resources:
requests:
memory: "8Gi"
cpu: "2300m"
limits:
memory: "8Gi"
cpu: "2300m"
roles:
- "data"
persistence:
pvc:
storageClass: standard-rwo
accessModes:
- ReadWriteOnce
##cold node
- component: digital-cold
labels:
lv_status: "BUILD"
lv_application: "OPENSEARCH"
lv_sla: "GOLD"
lv_contact: ""
replicas: 4
pdb:
enable: true
minAvailable: 2
diskSize: "800Gi"
jvm: -Xmx4000M -Xms4000M
nodeSelector:
resources:
requests:
memory: "8Gi"
cpu: "1600m"
limits:
memory: "8Gi"
cpu: "1600m"
roles:
- "data"
persistence:
pvc:
storageClass: standard-rwo
accessModes:
- ReadWriteOnce
###security settings/certs/serviceaccount setting follows.
=======================================================================
Opensearch Operator Configuration
nameOverride: ""
fullnameOverride: ""
nodeSelector: {}
tolerations: []
securityContext:
#--------------------------------------------------------------------------------------------------------
# Force to run as non-root account and skipt init container
#--------------------------------------------------------------------------------------------------------
runAsNonRoot: true
runAsUser: 1000
#fsGroup: 1000
manager:
#--------------------------------------------------------------------------------------------------------
# PPROF end point will help memory leak issue
#--------------------------------------------------------------------------------------------------------
pprofEndpointsEnabled: true
securityContext:
#--------------------------------------------------------------------------------------------------------
# Force to run as non-root account and skipt init container
#--------------------------------------------------------------------------------------------------------
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
extraEnv:
- name: SKIP_INIT_CONTAINER
value: "true"
resources:
limits:
cpu: 200m
memory: 500Mi
requests:
cpu: 100m
memory: 350Mi
livenessProbe:
failureThreshold: 10
httpGet:
path: /healthz
port: 8081
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 3
initialDelaySeconds: 120
readinessProbe:
failureThreshold: 10
httpGet:
path: /readyz
port: 8081
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 3
initialDelaySeconds: 120
# ---------------------------------------------------------------------------------------------------------------------
# To recover Master PODs, without this setting forever in "waiting cluster be available..." errror message
# ---------------------------------------------------------------------------------------------------------------------
parallelRecoveryEnabled: true
# ---------------------------------------------------------------------------------------------------------------------
image:
#--------------------------------------------------------------------------------------------------------
# OpenSearch operator image has been updated through Harbor
#--------------------------------------------------------------------------------------------------------
repository: <path to hub repo>
#tag: "2.5.1"
### Upgrading version from 2.5.1 to 2.7.0
tag: "2.7.0"
pullPolicy: "Always"
## Optional array of imagePullSecrets containing private registry credentials
imagePullSecrets: []
# - name: secretName
dnsBase: cluster.local
# Log level of the operator. Possible values: debug, info, warn, error
#--------------------------------------------------------------------------------------------------------
# As PODs are running in GKE, logs will go to GKE LOG REPO, and it will have cost
# So, changed log level to WARN to save cost
#--------------------------------------------------------------------------------------------------------
loglevel: warn
#--------------------------------------------------------------------------------------------------------
# If a watchNamespace is specified, the manager's cache will be restricted to
# watch objects in the desired namespace. Defaults is to watch all namespaces.
#
#--------------------------------------------------------------------------------------------------------
watchNamespace: observability
# Install the Custom Resource Definitions with Helm
installCRDs: true
serviceAccount:
#######################################################################################
<Serviceaccount configurtions>
##################################################################################
kubeRbacProxy:
enable: false
securityContext:
allowPrivilegeEscalation: false
resources:
limits:
cpu: 50m
memory: 50Mi
requests:
cpu: 25m
memory: 25Mi
livenessProbe:
failureThreshold: 10
httpGet:
path: /healthz
port: 10443
scheme: HTTPS
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 3
initialDelaySeconds: 120
readinessProbe:
failureThreshold: 10
httpGet:
path: /healthz
port: 10443
scheme: HTTPS
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
initialDelaySeconds: 120
image:
#--------------------------------------------------------------------------------------------------------
# Using image uploaded
#--------------------------------------------------------------------------------------------------------
repository: "<path>kube-rbac-proxy"
tag: "v0.15.0"