Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 2.13
Describe the issue:
We have 500 TB of customer events that we perform k-NN search on. We have all the data in indexes with ingest pipeline set up like this:
"processors": [{
"text_embedding": {
"model_id": "some-model-id",
"field_map": {
"searchText": "searchTextEmbedding"
}
}
}]
Later on, we might want to experiment with a new model, and run them side-by-side. I’m thinking the pipeline could update to:
"processors": [{
"text_embedding": {
"model_id": "some-model-id",
"field_map": {
"searchText": "searchTextEmbedding_Original"
}
}
}, {
"text_embedding": {
"model_id": "new-model-id",
"field_map": {
"searchText": "searchTextEmbedding_New"
}
}
}]
Now this ingest pipeline would apply the new embedding field for incoming data, but the existing 500 TB of data would not. To backfill all the data with the new model, I found reindex operation with the pipeline option: Reindex data - OpenSearch Documentation, which might look like:
POST _reindex
{
"source": {
"index": "source"
},
"dest": {
"index": "destination",
"pipeline": "pipeline-knn-models"
}
}
and then I could delete the source indexes.
Is there another way to approach introducing additional models (and backfilling) to the entire dataset?