Data nodes and shards sizing recommendation

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

  • AWS self managed opensearch 2.19

Describe the issue:

Recently indexing more fields which spike up to 500GB per index and configure with 20 shards per index

And after that, one day after the increase,

Search rate per min spike up from average 100k to 120k causing all data nodes CPU hit with 97% CPU usage, crashed 2 data nodes and auto recover back.

With total 570 shards ( primary and replicate), does this mean 6 data nodes with 16 CPU is actually not enough based on recommended sizing.

(Recommended sizing would be 1 shard for 1.5CPU) AOSPERF01-BP02 Check shard-to-CPU ratio - Amazon OpenSearch Service Lens

I temporary increased to 10 data nodes to help stop the issue and in monitoring but the questions are

  • What data nodes sizing should I configure? more data nodes with more CPU but costly as recommended sizing seem way to huge to allocate the right data nodes
  • Does smaller sharding size like 15 GB per shard is better than 25 per shard? With assumption the smaller shards help smaller search and less CPU usage but more distributed request to the shards?
  • Should i make the index smaller which mean smaller sharding? like weekly indices rather monthly indices?

Configuration:

  • 6 data nodes - r7g.4xlarge.search
  • 3 master nodes
  • recent 4 months indices with 500 GB size per index and 20 shards to balance out
  • the remaining old months 80GB per index with 3 shards for last 3 years

Relevant Logs or Screenshots:

@kitkit Since this is a search heavy use case I would recommend to keep ~25–30 GiB primaries (don’t shrink to 15 GiB). As this would increase the CPU load.
I would also tie the rollover to shard size as opposed to specific periods.

AWS Well-Architected calls out ~1.5 vCPUs per active shard as a conservative check. With ~570 shards total on 6Γ— r7g.4xlarge.search (16 vCPU each β‡’ 96 vCPU), you are far below that rule of thumb, which explains the ~97% CPU load. In these situations it is advisable to either add vCPUs (scale out / use a more CPU-dense family) or reduce the number of active shards (larger shards or fewer replicas on hot data). It is not recommended to go over 40-50GB shard size however.

@Anthony

I increased to 10 data nodes and CPU hovering around 50-60%.

I understand that CPU allocation required to increase but with 10 data nodes with 160 total CPU, currently serve well. so not thinking to match the AWS 1.5 vCPU per active shard because the cost to increase the CPU is expensive

You suggested that I should not shrink shard size to 15GB and the reason this would increase the CPU load because of

  • Higher CPU load: Each query must hit more shards = more CPU work

  • More coordination overhead: OpenSearch has to merge results from more shards

is that correct?

2nd questions that I would like to know more as current 1 index = 1 month data is 500GB with 20 shards. (25 GB per shard)

I thinking to break into daily/weekly index, rather monthly index to keep index size smaller, which then search smaller size data index that avoid less cpu usage and faster response.

Thinking what would best size per index that I should aim for?

The usage case would be each users will query smaller index and some feature that query all indices also will not impacted with lot of index to search..

@kitkit Your understanding is correct: more shards β†’ more per-shard threads + more merging, therefore more CPU for search.

Regarding switching to daily/weekly indices, this depends on the queries that the users will be running, if they span days, then moving to weekly should improve the performance. In your case you could try 5 primary shards for weekly index. This would be aprx 25GB on each primary.

1 Like