Adding and using custom embedding models from Hugging Face

Versions:
OpenSearch 2.11
opensearch-py 2.4.2
opensearch-py-ml 1.1.0
transformers 4.34.1
pytorch 2.1

The issue:
I want to add this model to OpenSearch. I have two problems:

  1. The main one is that the multilingual-e5-small model is trained such that it needs "query: " or "passage: " prefix added to the text before being passed to the model. Does OpenSearch support this? Something that allows us to save the text field as normal (without the prefix) but before embedding the field, we would add the prefix "passage: ". Is there an ingest processor that I can add to the ingest pipeline to do this before embedding?

  2. The second problem is in uploading and deploying the model to OpenSearch using the opensearch-py-ml library. This is my code where I load the model and save it in ONNX format following this example:

import opensearch_py_ml as oml
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_models import SentenceTransformerModel
from opensearch_py_ml.ml_commons import MLCommonClient

os_client = OpenSearch(...)
ml_client = MLCommonClient(os_client)

model_hf_id = "intfloat/multilingual-e5-small"
folder_path = "./models"
embedding_model = SentenceTransformerModel(model_id=model_hf_id, folder_path=folder_path, overwrite=True)
model_path_onnx = embedding_model.save_as_onnx(model_id=model_hf_id)
model_config_path_onnx = embedding_model.make_model_config_json(model_format="ONNX")
model_id = ml_client.register_model(model_path_onnx, model_config_path_onnx, isVerbose=True, deploy_model=False, model_group_id=model_group_id)
os_client.http.post(f"/_plugins/_ml/models/{model_id}/_deploy")

This is the output of model_path_onnx = embedding_model.save_as_onnx(model_id=model_hf_id):

ONNX opset version set to: 15
Loading pipeline (model: intfloat/multilingual-e5-small, tokenizer: intfloat/multilingual-e5-small)
Using framework PyTorch: 2.1.0
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
token_type_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask']
zip file is saved to  ./models/multilingual-e5-small.zip 

The model gets uploaded and registered successfully (50 chunks). But when I deploy the model, after a few minutes the task fails with this error:
Input mismatch, looking for: [input_ids, attention_mask]
But If I don’t save the ONNX model file myself and just use the one on hugging face with the config json I get from the code and zip them myself, I get a model that is 28 chunks and gets deployed successfully. I wonder if I’m doing something wrong when I save the model in ONNX format myself or if this is a bug.

I would appreciate your help and suggestions regarding these two problems.

Hi @Alireza ,

  1. Currently we don’t support this. But here’s a PR which is going to address this issue.

  2. Seems like a bug, could you please cut an issue here: Issues · opensearch-project/opensearch-py-ml · GitHub

Thanks
Dhrubo