Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
using latest docker image: opensearchproject/opensearch:latest
Describe the issue:
We’re testing vector search and so far not happy with results, wondering what have we done wrong.
I have test document with content:
“Ebola Virus Disease (EVD) and encourage U.S. hospitals to prepare for managing patients with\r\nEbola and other infectious diseases. Every hospital should ensure that it can detect a patient with\r\nEbola, protect healthcare workers so they can safely care for the patient, and respond in a coordinated fashion. Many of the signs and symptoms of Ebola are non-specific and similar to those of many common”
and use OpenAI embedding model “text-embedding-3-large” to generate embedding vectors.
The problem is, that if I search for not related word, I get very similar score as to word that actually exists in the document:
“dog Rex”: score 0.5197706
“virus”: score 0.5711079
When I do manual cosine equality in C# code, I get quite different values:
“dog Rex”: score 0.07607428956253202
“virus”: score 0.24901766327076158
While not perfect, but that’s ~6 times difference, compared to 10% difference in Open Search.
My manual cosine equality function looks like this:
double CalculateCosineSimilarity(float[] vector1, float[] vector2)
{
if (vector1.Length != vector2.Length)
{
throw new ArgumentException("Vectors must be of equal length.");
}
double dotProduct = vector1.Zip(vector2, (a, b) => a * b).Sum();
double magnitude1 = Math.Sqrt(vector1.Sum(a => a * a));
double magnitude2 = Math.Sqrt(vector2.Sum(b => b * b));
return dotProduct / (magnitude1 * magnitude2);
}
Configuration:
I’ve created index as following (initially tried with ef_construction 128, later increased to 500):
PUT /my-index
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"content": {
"type": "text"
},
"content_vector": {
"type": "knn_vector",
"dimension": 3072,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {
"ef_construction": 500,
"m": 16
}
}
}
}
}
}
post document:
POST /my-index/_doc/1
{
"content": "Ebola Virus Disease (EVD) and encourage U.S. hospitals to prepare for managing patients with\r\nEbola and other infectious diseases. Every hospital should ensure that it can detect a patient with\r\nEbola, protect healthcare workers so they can safely care for the patient, and respond in a coordinated fashion. Many of the signs and symptoms of Ebola are non-specific and similar to those of many common",
"content_vector": [ ...omitted for brevity... ]
}
search:
POST /my-index/_search
{
"size": 10,
"query": {
"knn": {
"content_vector": {
"vector": [ ...omitted for brevity... ],
"k": 10
}
}
}
}
Relevant Logs or Screenshots: