Describe the issue:
We have an OpenSearch cluster with 3 master nodes and 30 data nodes. We have a problem whereby 6 of our data nodes see ~double the search rate of the other 24 nodes. This is causing much higher CPU usage on these 6 nodes. This cluster has ~7 indexes on it. Our standard configuration is 6 primaries 4 replicas.
Things we’ve checked:
The shards are all equal in size and look balanced across the cluster.
Indexing load seems similar across nodes
Client configuration shouldn’t be favouring any one node
Additionally, the 6 data nodes are:
Distributed over our availability zones equally
Not all the 6 have a primary shard on them
Any thoughts would be appreciated as we are a bit stumped.
If shards are balanced between nodes and zones, you can check the adaptive_selection stats by GET _nodes/stats/adaptive_selection, nodes with lower rank value are preferred to execute the search request.
Thanks for the suggestion. We had a look at the segments and couldn’t see a problem.
We eventually solved our issue.
We found that the cluster needs to satisfy these conditions
pri.count * (rep.count + 1) == # of data nodes
(replica count + 1) % #Availability Zones (AZs) == 0
i.e. Each AZ will have 1 shard per node and an even number of shards per AZ :
As mentioned, our indexes have 6 primaries and 4 replicas and we are running 30 nodes across 3 AZs.
So: 6 (4 + 1) = 30 shards which are evenly distributed across the 30 nodes.
But (r + 1) % # No of AZs == 0 which means (4+1)%3 = 2, meaning uneven shards per AZ.
An AZ with fewer shards than the other AZs will serve the same number of requests on a node, so that node will have higher traffic than the others. This explains our imbalanced search rate/CPU.
We increased our cluster to 36 nodes and changed the primaries/replies to 6/5, which solved our issue.