Adding and using custom embedding models from Hugging Face

Alireza · December 25, 2023, 10:30am

Versions:
OpenSearch 2.11
opensearch-py 2.4.2
opensearch-py-ml 1.1.0
transformers 4.34.1
pytorch 2.1

The issue:
I want to add this model to OpenSearch. I have two problems:

The main one is that the multilingual-e5-small model is trained such that it needs "query: " or "passage: " prefix added to the text before being passed to the model. Does OpenSearch support this? Something that allows us to save the text field as normal (without the prefix) but before embedding the field, we would add the prefix "passage: ". Is there an ingest processor that I can add to the ingest pipeline to do this before embedding?
The second problem is in uploading and deploying the model to OpenSearch using the opensearch-py-ml library. This is my code where I load the model and save it in ONNX format following this example:

import opensearch_py_ml as oml
from opensearchpy import OpenSearch
from opensearch_py_ml.ml_models import SentenceTransformerModel
from opensearch_py_ml.ml_commons import MLCommonClient

os_client = OpenSearch(...)
ml_client = MLCommonClient(os_client)

model_hf_id = "intfloat/multilingual-e5-small"
folder_path = "./models"
embedding_model = SentenceTransformerModel(model_id=model_hf_id, folder_path=folder_path, overwrite=True)
model_path_onnx = embedding_model.save_as_onnx(model_id=model_hf_id)
model_config_path_onnx = embedding_model.make_model_config_json(model_format="ONNX")
model_id = ml_client.register_model(model_path_onnx, model_config_path_onnx, isVerbose=True, deploy_model=False, model_group_id=model_group_id)
os_client.http.post(f"/_plugins/_ml/models/{model_id}/_deploy")

This is the output of model_path_onnx = embedding_model.save_as_onnx(model_id=model_hf_id):

ONNX opset version set to: 15
Loading pipeline (model: intfloat/multilingual-e5-small, tokenizer: intfloat/multilingual-e5-small)
Using framework PyTorch: 2.1.0
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
token_type_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask']
zip file is saved to  ./models/multilingual-e5-small.zip

The model gets uploaded and registered successfully (50 chunks). But when I deploy the model, after a few minutes the task fails with this error:
Input mismatch, looking for: [input_ids, attention_mask]
But If I don’t save the ONNX model file myself and just use the one on hugging face with the config json I get from the code and zip them myself, I get a model that is 28 chunks and gets deployed successfully. I wonder if I’m doing something wrong when I save the model in ONNX format myself or if this is a bug.

I would appreciate your help and suggestions regarding these two problems.

dhrubo · February 20, 2024, 10:17pm

Hi @Alireza ,

Currently we don’t support this. But here’s a PR which is going to address this issue.
Seems like a bug, could you please cut an issue here: Issues · opensearch-project/opensearch-py-ml · GitHub

Thanks
Dhrubo

system · April 20, 2024, 10:17pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't generate embedding with a ML Huggingface model successfully deployed "Model no deployed" Machine Learning	4	1133	October 24, 2023
Deploying custom sentence embedding model - onnxruntime not found Machine Learning	4	776	July 6, 2024
Uploading a sentence transformer model of medical domain to OpenSearch Machine Learning troubleshoot	2	422	January 29, 2024
Help Needed: Fine-Tuning and Deploying a Model into OpenSearch Machine Learning discuss , troubleshoot , configure , install	2	267	August 12, 2024
How to register local custom model? Machine Learning	4	408	January 4, 2025

Adding and using custom embedding models from Hugging Face

Related topics