Need Neural Search Plugin to support Nested Field Type (Array of objects)

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch 2.8/2.9 (Neural Search Plugin 2.8/2.9)
Server OS: Linux

Describe the issue:
Need Neural Search Plugin to support generating vector values for nested field type. (When we tried out Neural Search Plugin, we found that it doesn’t support this feature. )

In our case, we split a large document in multiple chunks and saved them in an array as elements of a nested field type. We expect Neural Search Plugin to generate vector values for each element of the nested field type and save generated vector values along with the original elements.

Neural Search Plugin currently seems only support generate vector values for a “STRING” array and have generated vector values saved in different array.

Configuration:

In the following example, “question_embeddings” is a nested field type with two fields. “text” and “text_embedding”. “text” is the field that contains the value to be vectorized.
“text_embedding” will contain the vector value for the value in the “text” field.

; In “demo_pipeline”, we have “text” mapped to “text_embedding” in the “field_map”.

PUT ingest/pipeline/demo_pipeline

{
“description”: “ML search test pipeline”,
“processors” : [
{
“text_embedding”: {
“model_id”: “8-S0XYsBFQTip4T18k-U”,
“field_map”: {
“question_embeddings” :
{“text”: “text_embedding”}
}
}
}
]
}

; Following demo_index uses demo_pipeline.

PUT PUT demo_index
{
“settings”: {
“number_of_replicas”: “0”,
“index.knn”: true,
“default_pipeline”: “demo_pipeline”
},

“mappings”: {
“properties”: {
“id”: {
“type”: “keyword”
},

  "question_embeddings": {
    "type" : "nested",
    "properties": {
      "text": {
        "type": "text"
      },
      "text_embedding": {
         "type": "knn_vector",
         "dimension": 384,
         "method": {
           "name": "hnsw",
           "space_type": "l2",
           "engine": "nmslib"
        }
      }
    }
  }
}

}

}

Data to be ingested:

POST demo_index/_doc -d ‘{“id” : “1”, “question_embeddings” : [{“text” : “Eric is an engineer”},{“text” : “and he uses OpenSearch”}]}’

Expected results:

Following document should be ingested and vector value will be generated in the embedding field.

{“1”, “question_embeddings”, [{“text” : “Eric is an engineer”, “text_embedding” : “<Neural Search Plugin generated vector value for the text field.>“}, {“text” : “and he uses OpenSearch”, “text_embedding”, “<Neural Search Plugin generated vector value for the text field>”]

Relevant Logs or Screenshots:

Following is the error from the “ingest” command:

{“error”:{“root_cause”:[{“type”:“class_cast_exception”,“reason”:“class java.util.ArrayList cannot be cast to class java.util.Map (java.util.ArrayList and java.util.Map are in module java.base of loader ‘bootstrap’)”}],“type”:“class_cast_exception”,“reason”:“class java.util.ArrayList cannot be cast to class java.util.Map (java.util.ArrayList and java.util.Map are in module java.base of loader ‘bootstrap’)”},“status”:500}

Hi @erwang,
We are working on fixing the issue for nested fields. But the way you have created the ingestion processor the new will not work. Please see this updated one, which should work once the fix is launched.

PR: Treat . as a nested field in field_map of text embedding processor by Sanjana679 · Pull Request #488 · opensearch-project/neural-search · GitHub

{
	“description”: “MLsearchtestpipeline”,
	“processors”: [
		{
			“text_embedding”: {
				“model_id”: “8-S0XYsBFQTip4T18k-U”,
				“field_map”: {
					“question_embeddings.text”: "question_embeddings.text_embedding"
				}
			}
		}
	]
}

Thanks @Navneet,

Do you know when an fix will be available from github and the Neural Search Plugin branch the fix will be merged in?

Also, in our case, the “nested field type” field can be a child of another field. For example,

{
  "settings": {
    "number_of_replicas": "0",
    "index.knn": true,
    "default_pipeline": "demo_pipeline2"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "top-level-field" : {
      "question_embeddings": {
        "type" : "nested",
        "properties": {
          "text": {
            "type": "text"
          },
          "text_embedding": {
             "type": "knn_vector",
             "dimension": 384,
             "method": {
               "name": "hnsw",
               "space_type": "l2",
               "engine": "nmslib"
            }
          }
        }
       }
      }
    }
  }
}

I have tried the syntax you provided for the pipeline. There is no issue to create the pipeline and ingest data. But vector values are not created.

cat pipeline2.json

{
  "description": "ML search test pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "8-S0XYsBFQTip4T18k-U",
        "field_map": {
           "question_embeddings.text" : "question_embeddings.text_embedding"
        }
      }
    }
  ]
}

cat demo_index2.json

{
  "settings": {
    "number_of_replicas": "0",
    "index.knn": true,
    "default_pipeline": "demo_pipeline2"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "question_embeddings": {
        "type" : "nested",
        "properties": {
          "text": {
            "type": "text"
          },
          "text_embedding": {
             "type": "knn_vector",
             "dimension": 384,
             "method": {
               "name": "hnsw",
               "space_type": "l2",
               "engine": "nmslib"
            }
          }
        }
      }
    }
  }
}

cat ingest_nested_data.json

{
  "id": "1",
  "question_embeddings": [
    {
      "text": "Eric is an engineer"
    },
    {
      "text": "and he uses OpenSearch"
    }
  ]
}
curl -k -utest:welcome1 -H "content-type: application/x-ndjson" -X POST https://localhost:9200/demo_index2/_doc -d @ingest_nested_data.json 

{"_index":"demo_index2","_id":"EHQ9E4wBtx7Z2dS_3K-c","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":0,"_primary_term":

No vector values are ingested:

 curl -k -utest:welcome1 -H "content-type: application/x-ndjson" -X POST "https://localhost:9200/demo_index2/_search?size=10" | json_pp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   335  100   335    0     0  13958      0 --:--:-- --:--:-- --:--:-- 13958
{
   "took" : 0,
   "hits" : {
      "total" : {
         "value" : 1,
         "relation" : "eq"
      },
      "max_score" : 1,
      "hits" : [
         {
            "_id" : "EHQ9E4wBtx7Z2dS_3K-c",
            "_score" : 1,
            "_index" : "demo_index2",
            "_source" : {
               "question_embeddings" : [
                  {
                     "text" : "Eric is an engineer"
                  },
                  {
                     "text" : "and he uses OpenSearch"
                  }
               ],
               "id" : "1"
            }
         }
      ]
   },
   "_shards" : {
      "successful" : 1,
      "failed" : 0,
      "total" : 1,
      "skipped" : 0
   },
   "timed_out" : false
}

Hi @Navneet. I had a look the github PR you posted ( Treat . as a nested field in field_map of text embedding processor by Sanjana679 · Pull Request #488 · opensearch-project/neural-search · GitHub). I think the case is different than the use-case I raised. In that test case, it tried to create the vector for a child element “text” and the generated vector value will be saved in the top level field “message_embedding”. Basically, it wants to handle the situation that the original attribute and the generated embedding vector attribute are not belong to the same parent attribute.

Following is another use case which is similar to the issue you mentioned in the GitHub PR. However, in our case, the field is a “nested field” type and it needs to mapped to a array of embeddings (vector values). Following is the index mappings and pipeline definition:

{
  "settings": {
    "index.knn": true,
    "default_pipeline": "vector-search-pipeline"
  },
  "mappings": {
    "properties": {
     "passage_embedding": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
       "reference" : {
           "type": "nested",
           "properties": {
                "reference_question":{
                      "type": "text"
                      },
                "reference_query" : {
                       "type": "text"
                      }
           }
       }
    }
  }
}
Ingestion Pipeline-
{
  "description": "A text embedding pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "boOSM4wBeS0TqO_bDUAQ",
        "field_map": {
          "reference.reference_question": "passage_embedding"
        }
      }
    }
  ]
}