Search request rate imbalance

gvdp · July 3, 2024, 4:04pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch 2.11

Describe the issue:
We have an OpenSearch cluster with 3 master nodes and 30 data nodes. We have a problem whereby 6 of our data nodes see ~double the search rate of the other 24 nodes. This is causing much higher CPU usage on these 6 nodes. This cluster has ~7 indexes on it. Our standard configuration is 6 primaries 4 replicas.

Things we’ve checked:

The shards are all equal in size and look balanced across the cluster.
Indexing load seems similar across nodes
Client configuration shouldn’t be favouring any one node

Additionally, the 6 data nodes are:

Distributed over our availability zones equally
Not all the 6 have a primary shard on them

Any thoughts would be appreciated as we are a bit stumped.

gaobinlong · July 4, 2024, 2:43am

If shards are balanced between nodes and zones, you can check the adaptive_selection stats by GET _nodes/stats/adaptive_selection, nodes with lower rank value are preferred to execute the search request.

"adaptive_selection": {
        "_OkAyzOCTQ23Ez3tuPlICw": {
          "outgoing_searches": 0,
          "avg_queue_size": 0,
          "avg_service_time_ns": 31645992,
          "avg_response_time_ns": 51428164,
          "rank": "51.4"
        },
        "fJ32Cx6FTDOgiKMEnT44-w": {
          "outgoing_searches": 0,
          "avg_queue_size": 0,
          "avg_service_time_ns": 332002,
          "avg_response_time_ns": 735173,
          "rank": "0.7"
        }
      }

gvdp · July 10, 2024, 1:04pm

Thanks for the suggestion. We had a look at the segments and couldn’t see a problem.

We eventually solved our issue.

We found that the cluster needs to satisfy these conditions

pri.count * (rep.count + 1) == # of data nodes
(replica count + 1) % #Availability Zones (AZs) == 0

i.e. Each AZ will have 1 shard per node and an even number of shards per AZ :

As mentioned, our indexes have 6 primaries and 4 replicas and we are running 30 nodes across 3 AZs.

So: 6 (4 + 1) = 30 shards which are evenly distributed across the 30 nodes.

But (r + 1) % # No of AZs == 0 which means (4+1)%3 = 2, meaning uneven shards per AZ.

An AZ with fewer shards than the other AZs will serve the same number of requests on a node, so that node will have higher traffic than the others. This explains our imbalanced search rate/CPU.

We increased our cluster to 36 nodes and changed the primaries/replies to 6/5, which solved our issue.

Topic		Replies	Views
Shard Configuration for Optimal Search Performance OpenSearch	4	472	July 2, 2024
Demystifying Elasticsearch Shard Allocation \| Open Distro General Feedback	1	945	February 7, 2023
Improve the data nodes and shards configuration for performance OpenSearch troubleshoot , configure	7	4716	June 7, 2023
Replicas Performing All Queries OpenSearch troubleshoot , configure	0	267	October 12, 2022
To allocate shards evenly across nodes for two main indexes OpenSearch discuss , configure	4	1089	June 6, 2022

Search request rate imbalance

Related topics