Seemingly unrecoverable error after repurpose Master+Data --> Master

I updated my cluster’s certificates with more meaningful names & updated expiries. I doubt this is related, but figured I’d mentioned.

While I was at it, I noticed I accidentally had my master node set to be a data as well, so I turned that off, and ran ./elasticsearch-node repurpose. My cluster setup (its just a dev environment) is 5 data + 1 master + 1 coordinating.

When restarting the cluster, I couldn’t issue any commands due to: Open Distro Security not initialized

Logs showed: Not yet initialized (you may need to run securityadmin)

Security admin would not run because the cluster state was RED. Overriding this gave the error that the primary shard for .opendistro_security was unavailable.

But I couldn’t issue any routing commands to fix this, due to: Open Distro Security not initialized

So I’m stuck in a chicken-egg problem.

I have since wiped all my data and restarted OK, so this question is more of: what could I have done? What am I missing? Why was a missing primary shard a problem? There should have been at least 4 replica shards available as that index appears to have N-1 replicas, where N is the number of data nodes. Having a scenario that is unrecoverable is concerning.

Lastly, after only wiping /var/lib/elasticsearch (the data directory) on all data nodes to fix this, somehow, config.yml was overwritten with a default. Why and how could this even happen?

@lightbulb There are a number of actions that can be done in order to avoid this in the future. The first thing is 1 master node is not going to work in this case as to run ./elasticsearch-node repurpose - the node needs to be down, in which case that would mean that at that moment you would have no master and therefore nothing in charge of promoting replicas to primary. I managed to reproduce your issue locally with 1 master node, but when I attempted the same with 3 master eligible nodes, everything recovered as expected as the new master promoted the replica of the security index to primary. I would suggest in the future to find out which node the primary of the security index is located at and moving it off that node prior to taking it down for any reason. But there should not be any issues if you have at least 3 master eligible nodes. I would also recommend taking backups of the security index using securityadmin.sh with --retrieve option - which will dump contents in the relevant yml files, which can later be used to recreate security index after using the same tool with -dci option to delete the index and then recreate it. I hope this helps