We currently have 10,000 pieces of 768 dimensions of data that need to be stored every day.
If each data node loaded all vector data into memory, it would take up a lot of memory.
I would like to know if the vector cache memory can be spread across different machines to support horizontal scaling
Looking forward to your reply. Thank you
What do you mean? Opensearch is for indexing data then search it with the fastest speed xD
I’m planning to use the KNN plug-in, and I need to put a large number of 768 dimensional vectors into OpenDistro using Approximate k-NN for searching. I was doing a experiment, and I found out that OpenDsitro would preload vectors into RAM memory. Also, I learned how the cache footprint is calculated according to the documentation. As data increases, the cache footprint will also increase, this will cause troubles for the RAM memory usage. Does OpenDistro support assigning vectors cache RAM memory to different machines?
Hi @lin, so the vectors are stored in indices. When a search goes out, first the graphs for each shard being searched are loaded into memory if they are not already in the cache. Then the graphs are searched. Horizontal scaling is controlled by the shard count. For instance, if you have 5 shards and a 5 node cluster, each shard would get assigned to a single node. If you have 2 shards and a 5 node cluster, only 2 nodes would have a shard on them.
So, to achieve horizontal scalability, you have to have to come up with a good sharding strategy and roll over/reindex when necessary.
Does it mean If I have more nodes and shards to process the same amount of data, then the size of a single shard will be smaller, so the preload memory will also be smaller in each node? Let’s say, I have 10 shards and a 10 node cluster. It will be loaded and use less memory than 5 shards and a 5 node cluster. Is it correct? Thank you!
Correct. If you have 1,000,000 documents and 5 shards and 5 nodes, each shard will have roughly 200,000 documents and each node will have 1 shard. If you have 10 nodes and 10 shards, each shard will roughly have 100,000 documents and each node will have 1 shard.