Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.15 (AWS OpenSearch)
Describe the issue:
I’d like to double-check how ANN graph memory size works. I’ve found a few different equations for how much space a graph takes in memory.
I have 8 data nodes each with 64 GB RAM. My understanding of Opensearch and AWS is that half the RAM goes to the Opensearch heap (32 GB) and half of what’s remaining (32 GB) goes to graph memory caching (16 GB).
Meanwhile, my index has a KNN vector with 384 dimensions and an M of 16 (the default). There are 37 million documents and 1 replica.
Here’s a readout from a data node in GET /_plugins/_knn/*/stats
. It seems to show that the node is maxed out on graph memory with 15 GB-ish allocated and is unable to fit the graph in memory, judging from cache_capacity_reached
.
"qj0SM3L1RQmwF8yvift2ug": {
"max_distance_query_with_filter_requests": 0,
"graph_memory_usage_percentage": 99.36626,
"graph_query_requests": 149283,
"graph_memory_usage": 15283920,
"cache_capacity_reached": true,
"load_success_count": 39663,
"training_memory_usage": 0,
"indices_in_cache": {
"candidates-896bdaf8-1b82-4379-baa9-f47eb3b6f7d8": {
"graph_memory_usage": 15283920,
"graph_memory_usage_percentage": 99.36626,
"graph_count": 210
}
},
"script_query_errors": 0,
"hit_count": 109609,
"knn_query_requests": 0,
"total_load_time": 2357202456074,
"miss_count": 39674,
"min_score_query_requests": 80,
"knn_query_with_filter_requests": 0,
"training_memory_usage_percentage": 0,
"max_distance_query_requests": 0,
"lucene_initialized": false,
"graph_index_requests": 0,
"faiss_initialized": true,
"load_exception_count": 0,
"training_errors": 0,
"min_score_query_with_filter_requests": 80,
"eviction_count": 39453,
"nmslib_initialized": false,
"script_compilations": 0,
"script_query_requests": 0,
"graph_stats": {
"refresh": {
"total_time_in_millis": 0,
"total": 0
},
"merge": {
"current": 0,
"total": 0,
"total_time_in_millis": 0,
"current_docs": 0,
"total_docs": 0,
"total_size_in_bytes": 0,
"current_size_in_bytes": 0
}
},
"graph_query_errors": 0,
"indexing_from_model_degraded": false,
"graph_index_errors": 0,
"training_requests": 0,
"script_compilation_errors": 0
},
My understanding from this readout is that the graph doesn’t fit into memory meaning that ANN will be slower.
And here’s the index config for the vector field:
"my_vector": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "innerproduct",
"engine": "faiss",
},
"doc_values": False,
},
The closest equation I’ve found for that graph memory size is (1.1 * (dimension * 4 + 8 * m) * (num_documents *2) / 1024 / 1024 / 1024) / num_nodes
for the GB of the graph - that’s 15.7 GB, assuming it’s right.
My core question here though is - how do I predict the size of the graph memory? Right now I know it doesn’t fit into 15.2 GB RAM. Am I really only .5 GB off from what I need (based on the 15.7 GB number)? Is 15.7 GB of RAM what’s required for a single data node with a vector of this type?
Configuration:
Relevant Logs or Screenshots: