Preformance best practice/ preformance issues

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 2.17.1

Describe the issue: We currently manage a 600 GB OpenSearch index across 3 shards on 3 pods (6 vCPUs, 14 GB RAM xmx each, pods has 18gb of ram) and face performance (in time of the day when usage is high) issues with high-intensity searches, including high latencies.

I’m seeking best practices to optimize for low-latency search workloads while ensuring scalability.

Considering scaling to pods with 16 vCPUs, 24 GB Xmx heap, and 34 GB RAM total, supporting 40 primary shards + 1 replica (80 total shards).

Key questions:

  • Is horizontal scaling (more pods) better than or complementary to vertical scaling?

  • How to calculate what is best in my case. My plan is to go with 40 shards, but i do not know what i need to regarding vertical or horizontal scaling? Any benchmarks, formulas, or tools like OpenSearch Rally?

Configuration:
600gb index, 3 shards, 14gb xmx (18ram per POD), vcpu6

Relevant Logs or Screenshots:

Hi @miha12345 ,

When building for low latency I would consider the following few points.

A shard is typically scoped to be 10 - 50 gb in size, when low search latency is considered this number shifts to be more 10 - 30 gb. - https://opensearch.org/blog/optimize-opensearch-index-shard-size/ .

For CPU it is recommended to use 1.5 CPU per shard - Operational best practices for Amazon OpenSearch Service - Amazon OpenSearch Service .

Then for ram you can see here - Choosing the number of shards - Amazon OpenSearch Service

I would review these, then consider your changes while ensuring you tick the boxes you’re after for low latency.

Leeroy.