I noticed that there are two line of versions available for opensearch – 1.x and 2.x, both of which were updated on Dec 13, 2022. Can someone please explain the logic behind this?
I ask, because my team is currently using version 1.2.4 and we’re getting ready to upgrade it to match latest version. However, it appears that there are two choices:
Either upgrade to version 1.3.7 or upgrade to version 2.4.1 and I’m not quite following what is the right approach.
Hey @Firdaus1 so the way we do development is we do new feature work in the most recent version. We still do maintenance for the previous major release. So at the moment we are doing new feature work in the 2.x line and 1.3.x line is getting bugfixes and patches still.
Also, I would recommend first doing a rolling upgrade to the most recent 1.3.x version then you can upgrade to the 2.x line. I believe that the common patterns for upgrading major versions are either doing a blue/green where you spin up a new cluster and migrate traffic and data to the new cluster. Or you can do an in place upgrade but that involves downtime.
We recently took a few clusters from 1.3.1 and 1.3.2 to 2.4.0 and ran into two issues.
There seems to be a bug in 1.3.x releases prior to 1.3.7 where if the cluster was ever in OpenDistro for Elasticsearch (ODFE) compatibility mode, then the cluster gets stuck in that mode, even if you manually disable compatibility. This is an issue as while in ODFE mode OpenSearch serializes certain inter-node messages using Java class names that match what ODFE uses. OpenSearch 1.x releases have logic to translate the ODFE class names to OpenSearch class names, but 2.x releases do not.
We could join a OpenSearch 2.x node to the cluster, but it would throw exceptions related to unknown class names in user and security messages and fail to take on any shards. We had to wait for 1.3.7 to release, then update all 1.3.x nodes to 1.3.7 before we could perform a rolling upgrade 2.4.0. Updating to 1.3.6 (latest at the time) did not help.
This was more of a self-inflicted issue from not double checking existing shard allocation awareness settings along with keeping replica allocation enabled while performing a rolling upgrade (internal policy, don’t ask). For context we handle upgrades by removing one node from a cluster, upgrading it, return it to the cluster, wait for all shards to reallocate (including replicas), then move onto the next node. Repeat until all nodes are updated.
When upgraded nodes are added back to the cluster and they take on a primary shard, then that shard (along with any replicas of that shard) can no longer be allocated to nodes running the previous OpenSearch version. In our case, the first data node updated successfully and was added back into the cluster. It had several replica shards assigned to it, including shards from the next node to be upgraded. When we shut down the next node, all those replica shards on node 1 became primaries, and thus any replicas could not be allocated to any of the other nodes in the cluster (all others still running previous version).
Node 2 was upgraded and readded to the cluster, but it was in the same availability zone as node 1, so the replicas would not allocate to node 2, and since our process requires 100% shard allocation prior to continuing to the next node, our process stalled. We had to temporarily disable shard allocation awareness, complete the rolling upgrade, then enable shard allocation awareness.