Multi field chunking and embedding

tuka · August 25, 2025, 11:42pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): Latest, running in docker

Describe the issue:

I am trying to follow the example to create a RAG chatbot with conversational flow agent in dev tools. Have following model

POST /_plugins/_ml/models/_register
{
“name”: “huggingface/sentence-transformers/all-MiniLM-L12-v2”,
“version”: “1.0.2”,
“model_format”: “TORCH_SCRIPT”
}

With following pipeline

PUT /_ingest/pipeline/test_population_data_pipeline
{
“description”: “text embedding pipeline”,
“processors”: [
{
“html_strip”: {
“field”: “SubmissionStoryContent”,
“target_field”: “SubmissionStoryContent_clean”
}
},
{
“text_chunking”: {
“algorithm”: {
“fixed_token_length”: {
“token_limit”: 100,
“overlap_rate”: 0.2,
“tokenizer”: “standard”
}
},
“field_map”: {
“SubmissionSubject”: “SubmissionSubject_chunk”,
“SubmissionStoryContent_clean”: “SubmissionStoryContent_chunk”
}
},
“text_embedding”: {
“model_id”: “9W7z4pgBK5X7Z9B4zPri”,
“field_map”: {
“SubmissionSubject_chunk”: “SubmissionSubject_embedding”,
“SubmissionStoryContent_chunk”: “SubmissionStoryContent_embedding”
}
}
}
]
}

Basically I have 2 fields, SubmissionSubject and SubmissionStoryContent. I strip html from SubmissionStoryContent –> SubmissionStoryContent_clean and then use chucking processor to create chunks for both fields. . Then vectorize using earlier deployed model. When a simulate the pipeline all looks good.

Following is the index I want to use.

PUT test_population_data1
{
“settings”: {
“default_pipeline”: “test_population_data_pipeline”,
“index.knn”: true
},
“mappings”: {
“properties”: {
“SubmissionSubject”: {
“type”: “text”
},
“SubmissionSubject_embedding”: {
“type”: “knn_vector”,
“dimension”: 384,
“method”: {
“engine”: “lucene”,
“space_type”: “l2”,
“name”: “hnsw”,
“parameters”: {}
}
},
“SubmissionStoryContent”: {
“type”: “text”
},
“SubmissionStoryContent_embedding”: {
“type”: “knn_vector”,
“dimension”: 384,
“method”: {
“engine”: “lucene”,
“space_type”: “l2”,
“name”: “hnsw”,
“parameters”: {}
}
}
}
}
}

However when I try to send a document to index I receive following error

…..“index”: {
“_index”: “test_population_data1”,
“_id”: “JG6Z45gBK5X7Z9B4lPtk”,
“status”: 400,
“error”: {
“type”: “mapper_parsing_exception”,
“reason”: “failed to parse field [SubmissionSubject_embedding] of type [knn_vector] in document with id ‘JG6Z45gBK5X7Z9B4lPtk’. Preview of field’s value: '{knn=[-0.08113378, 0.0…………….caused_by”: {
“type”: “json_parse_exception”,
“reason”: “”"Current token (START_OBJECT) not numeric, can not use numeric value accessors
at [Source: REDACTED (StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION disabled); line: 1, column: 248]

Could not figure out where is the issue. Any ideas will be appreciated.

Configuration:

Relevant Logs or Screenshots:

Topic		Replies	Views
Opensearch 2.13 Text chunking test error OpenSearch troubleshoot	3	102	September 4, 2024
Error on connecting external connector and embedding model Machine Learning	7	141	January 17, 2026
Append processor for vector field OpenSearch configure	6	133	October 12, 2024
Regarding storing vectors k-NN troubleshoot	3	349	February 6, 2024
Generating embeddings for arrays of objects OpenSearch index-management	2	115	March 4, 2025

Multi field chunking and embedding

Related topics