Problems with kNN-searches

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.8.0
OS Ubuntu 20.04.6 LTS

Describe the issue:
Hello! I have troubles with approximate kNN search. We have a cluster with ~300 kNN-indices, every index has ~9 million vectors (8 shards, cosine metric, dimension 512, m=16, ef=1024). All indices merged to 1 segment by shard. All indices added to one alias, two indexes per one date, 5 months of data (2 indices * 30 days * 5 months = 300 indices)
Several problems were found during testing:

  1. time to search for a single vector:
  • 60 indices (1 month) = ~120sec
  • 120 indices (2 months) = ~240sec
  • 180 indices (3 months) = ~360sec
  • 240 indices (4 months) = 480-540sec (strange)
  • 300 indices (5 months) - The search is not working, no errors, wait more than one hour

to get these results, I have increased the following parameters:
“search.default_keep_alive”: “30m”,
“search.max_open_scroll_context”: 2000

before that, when I started the search by 3 months, sometimes there was an error “No context found for id XXXX”

the question is why the search for 4 months works for 480 seconds, but the search for 5 months does not work?

  1. OS kills the opensearch process due to lack of memory
    I tried changing circuit breaker memory limit by 70%, tried changing JVM heap size. At first, JVM heap size was 30 GB, then it was lowered to 8 GB.
    If I run one search (by 180 indices) at a time, then the virtual machine’s memory utilization is kept at 75-90%. But if I run another search in parallel (other indices), memory utilization reaches 100% and the OS kills the process. Why is this happening?

  2. CPU utilization during the search for some reason does not rise above 60%. What parameter controls this?

P.S. we have another cluster for “hot”-searches (all data stored in RAM), with more powerful hardware and more data. It does not have such problems. And now we are trying to create a cluster for “cold” searches

Configuration:
8 datanodes, 8 CPUs, 64Gb RAM, 5Tb SSD

cluster parameters:

{
“persistent”: {
“action”: {
“destructive_requires_name”: “true”
},
“cluster”: {
“routing”: {
“rebalance”: {
“enable”: “none”
},
“allocation”: {
“allow_rebalance”: “indices_all_active”,
“cluster_concurrent_rebalance”: “15”,
“node_concurrent_recoveries”: “2”,
“disk”: {
“threshold_enabled”: “true”,
“watermark”: {
“low”: “99.97%”,
“flood_stage”: “99.99%”,
“high”: “99.98%”
}
},
“enable”: “all”,
“node_concurrent_outgoing_recoveries”: “2”
}
},
“metadata”: {
“perf_analyzer”: {
“state”: “0”
}
}
},
“knn”: {
“algo_param”: {
“index_thread_qty”: “7”
},
“cache”: {
“item”: {
“expiry”: {
“enabled”: “false”,
“minutes”: “1s”
}
}
},
“circuit_breaker”: {
“triggered”: “true”
},
“memory”: {
“circuit_breaker”: {
“limit”: “70%”,
“enabled”: “true”
}
}
},
“search”: {
“default_keep_alive”: “30m”,
“max_open_scroll_context”: “2000”
},
“plugins”: {
“replication”: {
“follower”: {
“metadata_sync_interval”: “15s”
}
},
“index_state_management”: {
“template_migration”: {
“control”: “-1”
}
}
}
},
“transient”: {}
}

Relevant Logs or Screenshots:

@jmazane

@krishna_ggk

@vamshin

can you help me please?