Timeouts during the search load test

Opensearch 2.3

Describe the issue
We implemented a search queries load test for the OpenSearch cluster with the knn plugin installed.
During the load test we get a lot of timeouts shortly after the start.
Index with ~22M items is pre-uploaded (144Gb storage, 32 shards, 890 segments). The vectors inside are 512-d, lucene hnsw is used.
ef_construction: 32
M: 64
load test concurrency: 25
What is the expected behavior?
Either timings are low or any watched metrics shows clearly the reason of the problem (something to be scaled or reconfigured).
What is your host/environment?
The cluster is run on the 16 m6g.xlarge.search data nodes and 3 r6g.large.search master nodes.
Do you have any additional context?
We’re trying to monitor the source of the problem using CPUUtilization, JVMMemoryPressure, Free Storage.
None of those gets close to the limit during the test.
KNNGraphMemoryUsage is always 0, which is different from faiss and nmslib hnsw tests.
Can you please give me any guidance on what metrics or potential problems should I look for?


What is your timeout setting (I assume you got timeouts for query load type), value of k, amount of RAM allocated for JVM on data nodes, num of replicas?

I can give few general recommendations after looking on provided description:
try values of m and ef_construction that are closer to Lucene defaults: m = 16, ef_construction = 100.
I’m not sure if you merge segments or not, but with your settings 890/32shards ~28 segments per shard can be a lot. try lower number of segments to say 10 per shard, you can call force_merge from Index API with max_segments = 10 (number of segments is per shard). More segments give you better recall but tradeoff is higher latency as search times per segment are combined and sum is greater than search time for a single big segment.