Replicas: 3 Masters (currently trying to recover with 1)
The Issue: My transport and http certificates expired. I attempted to rotate them by deleting the Kubernetes secrets and letting the Operator recreate them. While the secrets were recreated successfully, the cluster is now stuck in a deadlock:
Masters are not Ready: The master pods are running but not “Ready” because the Security Plugin is not initialized.
Quorum Blocked: With only 1 replica active for troubleshooting, the node refuses to elect itself as master because it remembers the old 3-node quorum (requires at least 2 nodes).
Security Initialization Loop: I cannot run securityadmin.sh because the REST API is blocked (Security not initialized), and the script times out because the cluster state is RED/Not Elected.
Circular Dependency: I can’t initialize security because the cluster isn’t up, and the cluster won’t stay up/ready because security isn’t initialized.
What I’ve tried:
Deleting secrets to force certificate regeneration.
Setting discovery.type: single-node (rejected by Operator/Configuration conflicts).
Running securityadmin.sh manually from within the pod (SocketTimeout/Connection refused).
Request: How can I force the Security Plugin to initialize or bypass the quorum check to let securityadmin.sh apply the new certificates to the .opendistro_security index when the cluster is in this state?
Hi, we are about to release a new operator version. This is one of the issues we got fixed. Will you be able to upgrade your operator version to latest?
Without seeing actual logs etc we cannot tell if the deadlock will be fixed for sure. The new operator version due to be released soon (3.0) has fixed many thing that should prevent this from happening in the future in the first place
I understand that operator 3.0 mainly prevents this from happening in the future, but my issue is the current production cluster, which is already stuck in a deadlock state.
Before taking any destructive action, could you please advise:
Is there any known recovery procedure to fix an already broken cluster like this without losing existing indexed data?
Can upgrading only the operator help recover a cluster stuck with a single master not ready, or is the fix purely preventive?
Just to confirm: upgrading the operator does not upgrade OpenSearch itself, correct?
What specific logs or outputs would you need from me to better analyze the current situation?
This is a production environment and preserving existing log data is critical.
@v1k1ng0 have you looked at hot reloading the certificates, see docs
I would also recommend to check if the new certificates were loaded into the pods. Did the masters restart and thats when this happened? Did the CA also expire or only the leaf certificates?
Yes, I reviewed the hot-reloading documentation. As far as I understand, hot reload must be enabled before certificate rotation. I did not originally deploy this cluster, and I’m not sure whether hot reload was configured at the time.
What likely happened is:
Transport certificates expired
We deleted/regenerated certificates
The masters were restarted
After restart, the cluster entered the current deadlock state (single master not ready, no quorum)
At this point, hot reload no longer seems applicable since the cluster cannot form and security cannot be initialized.
To clarify:
Only the leaf certificates expired, not the CA
Masters did restart around the time the issue started
Certificates inside the pods are now valid and readable, but the cluster is still stuck
Given the current state (cluster_manager_not_discovered, security not initialized), is there any supported recovery path to bring the cluster back without wiping data, or is this situation unrecoverable once reached?
If you need specific logs or config to assess this, please let me know exactly what to provide.