Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Versions 2.14.1, 2.16.1
Describe the issue:
While upgrading the cluster, it is expected that replicas cannot be assigned to nodes running on older binary(2.14.1) if the primary shard is allocated to a node running on newer binary(2.16.1). But during a recent upgrade, we observed that this was not the case.
As can be seen in the screenshots below, the replica for ‘shard 6’ is successfully running on a node with older binary(2.14.1), whereas the primary for ‘shard 6’ is on a node with newer binary(2.16.1).
Unfortunately, I could not find the documentation for the same on OpenSearch documentation, but we can see in the ES documentation that this cannot be the case.
During a rolling upgrade, primary shards assigned to a node running the
new version cannot have their replicas assigned to a node with the old version.
The new version might have a different data format that is not understood
by the old version.
If it is not possible to assign the replica shards to another node
(there is only one upgraded node in the cluster), the replica shards
remain unassigned and status stays yellow.
In the upper part of the screenshot attached, we can see the configs for the different nodes in the cluster, along with the ip they’re hosted on, and the OS version they’re running. All nodes in the pic are data
nodes.
In the lower part of the screenshot attached, we see the shard allocation across different nodes. Let’s take the info about ‘shard 6’.
We see that primary for shard 6 is on node with ip 10.74.31.155
, which is running 2.16.1
.
One of the replicas for shard 6 is running on a node with ip 10.68.221.141
, which is running on 2.14.1
.
Observations
We observed that if we try to manually reallocate the replica shard, we see the below error during reallocation, and the reallocation fails
[NO(cannot allocate replica shard to a node with version [2.14.1] since
this is older than the primary version [2.16.1])]
This warning is expected.
Configuration:
The cluster is running a total of 7 nodes, with 3 master nodes, and 4 data nodes. 2 data nodes are running on the version 2.14.1
, whereas the two other data nodes are running version 2.16.1
.
The two upgraded nodes were upgraded sequentially. Before the upgrade of each node, the following config was set
cluster.routing.allocation.enable: primaries
Post upgrade of the node, the config was changed to
cluster.routing.allocation.enable: all
Relevant Logs or Screenshots: