Question about ANN graph memory size

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.15 (AWS OpenSearch)

Describe the issue:
I’d like to double-check how ANN graph memory size works. I’ve found a few different equations for how much space a graph takes in memory.

I have 8 data nodes each with 64 GB RAM. My understanding of Opensearch and AWS is that half the RAM goes to the Opensearch heap (32 GB) and half of what’s remaining (32 GB) goes to graph memory caching (16 GB).

Meanwhile, my index has a KNN vector with 384 dimensions and an M of 16 (the default). There are 37 million documents and 1 replica.

Here’s a readout from a data node in GET /_plugins/_knn/*/stats. It seems to show that the node is maxed out on graph memory with 15 GB-ish allocated and is unable to fit the graph in memory, judging from cache_capacity_reached.

"qj0SM3L1RQmwF8yvift2ug": {
      "max_distance_query_with_filter_requests": 0,
      "graph_memory_usage_percentage": 99.36626,
      "graph_query_requests": 149283,
      "graph_memory_usage": 15283920,
      "cache_capacity_reached": true,
      "load_success_count": 39663,
      "training_memory_usage": 0,
      "indices_in_cache": {
        "candidates-896bdaf8-1b82-4379-baa9-f47eb3b6f7d8": {
          "graph_memory_usage": 15283920,
          "graph_memory_usage_percentage": 99.36626,
          "graph_count": 210
        }
      },
      "script_query_errors": 0,
      "hit_count": 109609,
      "knn_query_requests": 0,
      "total_load_time": 2357202456074,
      "miss_count": 39674,
      "min_score_query_requests": 80,
      "knn_query_with_filter_requests": 0,
      "training_memory_usage_percentage": 0,
      "max_distance_query_requests": 0,
      "lucene_initialized": false,
      "graph_index_requests": 0,
      "faiss_initialized": true,
      "load_exception_count": 0,
      "training_errors": 0,
      "min_score_query_with_filter_requests": 80,
      "eviction_count": 39453,
      "nmslib_initialized": false,
      "script_compilations": 0,
      "script_query_requests": 0,
      "graph_stats": {
        "refresh": {
          "total_time_in_millis": 0,
          "total": 0
        },
        "merge": {
          "current": 0,
          "total": 0,
          "total_time_in_millis": 0,
          "current_docs": 0,
          "total_docs": 0,
          "total_size_in_bytes": 0,
          "current_size_in_bytes": 0
        }
      },
      "graph_query_errors": 0,
      "indexing_from_model_degraded": false,
      "graph_index_errors": 0,
      "training_requests": 0,
      "script_compilation_errors": 0
    },

My understanding from this readout is that the graph doesn’t fit into memory meaning that ANN will be slower.

And here’s the index config for the vector field:


"my_vector": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                },
                "doc_values": False,
            },

The closest equation I’ve found for that graph memory size is (1.1 * (dimension * 4 + 8 * m) * (num_documents *2) / 1024 / 1024 / 1024) / num_nodes for the GB of the graph - that’s 15.7 GB, assuming it’s right.

My core question here though is - how do I predict the size of the graph memory? Right now I know it doesn’t fit into 15.2 GB RAM. Am I really only .5 GB off from what I need (based on the 15.7 GB number)? Is 15.7 GB of RAM what’s required for a single data node with a vector of this type?

Configuration:

Relevant Logs or Screenshots:

From my understanding Graph Memory resides on javas heap memory (in "Opensearch heap (32 GB) ") and not in OS independently. With one partial exception - Expanding k-NN with Lucene approximate nearest neighbor search · OpenSearch

I also have a similar question to yours and haven’t heard an answer yet :frowning: Where can I find total vector count per index?

Hi Ken,

The formula for calculating the graph size looks correct:

Total Graph Size (GB) = 1.1 * (dimension * 4 + 8 * M) * (num_documents * (1 + Number of Replicas)) / 1024 / 1024 / 1024

Regarding the cache capacity being reached and the eviction count of 39,453, here’s what’s happening:

By default, the circuit breaker is set to 50% of available memory (as outlined in this documentation). Since 32GB is allocated for JVM and out of Remaining 32GB, 16GB is available for graph caching. As the memory usage approaches this threshold, frequent evictions occur because the circuit breaker is being triggered, resulting in constant loading and unloading of graphs.

To mitigate this, one approach is to increase the circuit breaker percentage to 60%. This would allow approximately 19.2GB for caching, ensuring that a graph of 15.7GB can be loaded into memory without the circuit breaker being triggered.

Please let me know if you any further questions.

Thanks

1 Like