Error When Loading Embedding Model Into Memory

jsukup · August 1, 2023, 4:32am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

AWS OpenSearch Service 2.7 Dashboard
Google Chrome

Describe the issue:

I’m attempting to load a pretrained embedding model into memory using:

POST /_plugins/_ml/models/_register
{
“name”: “huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2”,
“version”: “1.0.1”,
“model_format”: “TORCH_SCRIPT”
}

This completes successfully and I’m provided with a model ID when I run this:

GET /_plugins/_ml/tasks/k4w9r4kBBiUmBL-z11AC

Response:

{
“model_id”: “lIw9r4kBBiUmBL-z3FC-”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “COMPLETED”,
“worker_node”: [
“l8xgkWdfQNGhuez3MPMAqw”
],
“create_time”: 1690862212827,
“last_update_time”: 1690862310966,
“is_async”: true
}

However, when I try to load this model into memory using this:

POST /_plugins/_ml/models/lIw9r4kBBiUmBL-z3FC-/_load

I’m given this response:

{
“model_id”: “lIw9r4kBBiUmBL-z3FC-”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“l8xgkWdfQNGhuez3MPMAqw”,
“NNPgI_0_QUaENi_weYIkNQ”
],
“create_time”: 1690863161315,
“last_update_time”: 1690863312176,
“error”: “”“{“l8xgkWdfQNGhuez3MPMAqw”:“model content changed”,“NNPgI_0_QUaENi_weYIkNQ”:“model content changed”}”“”,
“is_async”: true
}

I cannot find any information regarding what this error message means. Also, I was able to successfully load this model before and use it to create vector embeddings which I still have saved in an index.

ylwu · August 1, 2023, 7:20am

Are you running on MacOS? There is a known issue [BUG]Model content hash can't match original hash value · Issue #844 · opensearch-project/ml-commons · GitHub

jsukup · August 1, 2023, 3:56pm

No, this in being run on AWS. I saw this post your reference as well, but I think it is regarding something different.

ylwu · August 1, 2023, 4:45pm

Got it, can you share your cluster settings? Like how many data nodes, the EC2 instance types. We need to reproduce the error to dive deep.

jsukup · August 1, 2023, 4:57pm

NOTE: I just changed this to 4 nodes but it was 2 nodes during the issue outlined.

ylwu · August 1, 2023, 5:07pm

Did you use dedicated master node?

NOTE: I just changed this to 4 nodes but it was 2 nodes during the issue outlined.

Changing to 4 nodes can solve the problem?

jsukup · August 2, 2023, 7:43pm

The problem seems intermittent – sometimes the model load successfully, other times not. I’ve also tried using the “_deploy” method rather than “_load” which I read somewhere as an alternative approach, but it doesn’t seem to matter.

dhrubo · August 2, 2023, 7:50pm

I was able to reproduce the issue in my end. In my case first time when invoked _load api, it was partially loaded and I was able to generate embedding. But when I invoked _unload api and then invoke _load api, I was able to see the issue.

I’ll try to deep dive more into this issue.

In the meantime, I tried to reproduce this issue with bigger instances, but couldn’t reproduce there.

We are transitioning to _deploy from _load. In the long run _load will be deprecated.

jsukup · August 2, 2023, 8:33pm

On a related note, another thing I’ve noticed is that sometimes a model loaded into memory (verified using GET /_plugins/_ml/profile/models) will sometimes seemingly _unload on its own after a given amount of time (again, verified using GET /_plugins/_ml/profile/models with an empty response of {}).

Is this behavior expected?
Is this due to idle time?
Is it best practice to _unload a model after use?

ylwu · August 2, 2023, 10:11pm

@dhrubo may help reproduce and dive deep for this problem too.

I haven’t done any research yet. But I guess that may be caused by small EC2 instance type. t3.small.search has just 2 vCPU, and 2 GB memory. That looks too constrained to run model. Guess that maybe cause some unexpected error and model unloaded.

jsukup · August 3, 2023, 3:20am

@ylwu Yes, I upgraded the instance and it seems to be performing better without throwing the error this time.

system · October 2, 2023, 3:21am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error while loading ML model in ElasticSearch General Feedback	24	2424	November 11, 2024
Failed to Deploy Model OpenSearch	0	548	September 5, 2023
Not able to Register the model even after following documentation Machine Learning all-clients , discuss , troubleshoot , configure , install	2	882	November 25, 2023
TransportError - m_l_exception - model not deployed Machine Learning	5	1141	July 11, 2023
Cannot upload ML model Machine Learning	3	777	June 18, 2023

Error When Loading Embedding Model Into Memory

Related topics