Error while loading ML model in ElasticSearch

TonyStark · April 12, 2023, 1:11am

Hi. I have a question. When i am trying to load simple integrated model through API in ElasticSeearch 2.6:

POST /_plugins/_ml/models/_upload
{
“name”: “huggingface/sentence-transformers/all-MiniLM-L12-v2”,
“version”: “1.0.1”,
“model_format”: “TORCH_SCRIPT”
}

Got response with the error:
{
“task_type”: “UPLOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“yBvD1Q_vSLi8si1SKUvm5Q”
],
“create_time”: 1681260277521,
“last_update_time”: 1681260277551,
“error”: “Native Memory Circuit Breaker is open, please check your resources!”,
“is_async”: true
}

Tasks are always executed on data nodes with 32GB RAM (28GB for JVM), 4TB HDD, and 16 CPU.
Data nodes are not loaded at all.
I think that these resources should be enough to load the model.
Who knows ? What is the reason for this error?

Gsmitt · April 12, 2023, 1:46am

Hey @TonyStark

I just had this happen, my fix was:

Check my stat’s

GET _nodes/stats/breaker

Then I found by changing this

From

plugins.ml_commons.native_memory_threshold: 90

To

plugins.ml_commons.native_memory_threshold: 100

Restart service.

Take note, there is a reason why the doc say set it to 90.

Shown here

TonyStark · April 12, 2023, 4:16am

Thank you. This helped in solving this problem.
I set plugins.ml_commons.native_memory_threshold to 100 without restarting nodes. I assumed that this setting is dynamic. Аnd tried loading the model again.

But when i execute:
GET /_plugins/_ml/tasks/W2ikc4cB2GhI_wXsfqRg

Got response:
{
“task_type”: “UPLOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “CREATED”,
“worker_node”: [
“G2_UQ118RcyJEbydg2HWtw”
],
“create_time”: 1681272372831,
“last_update_time”: 1681272372831,
“is_async”: true
}

The answer must be returned ModelId. After some time, I execute again:
GET /_plugins/_ml/tasks/W2ikc4cB2GhI_wXsfqRg

And I get response with the next error:
{
“task_type”: “UPLOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“G2_UQ118RcyJEbydg2HWtw”
],
“create_time”: 1681272372831,
“last_update_time”: 1681272503894,
“error”: “Connection timed out”,
“is_async”: true
}

TonyStark · April 12, 2023, 4:21am

“ml_task_timeout_in_seconds”: “600” must be changed :))

Gsmitt · April 12, 2023, 4:22am

Yeah @TonyStark

Really just did this yesterday, took like 8 hours to understand what was happening to resolve it.
I’m fairly new at this.

TonyStark · April 12, 2023, 4:23am

However, the timeout is triggered before 600 seconds.

Gsmitt · April 12, 2023, 4:23am

let me check my configs

Gsmitt · April 12, 2023, 4:24am

Do you have something like this?

######## Start OpenSearch Machine learning ########
plugins.ml_commons.task_dispatch_policy: round_robin
plugins.ml_commons.max_ml_task_per_node: 10
plugins.ml_commons.max_model_on_node: 10
plugins.ml_commons.sync_up_job_interval_in_seconds: 5
plugins.ml_commons.monitoring_request_count: 100
plugins.ml_commons.max_upload_model_tasks_per_node: 10
plugins.ml_commons.max_load_model_tasks_per_node: 10
plugins.ml_commons.ml_task_timeout_in_seconds: 600
plugins.ml_commons.native_memory_threshold: 100
plugins.ml_commons.only_run_on_ml_node: false

TonyStark · April 12, 2023, 4:31am

My config:

“ml_commons”: {
“task_dispatch_policy”: “round_robin”,
“monitoring_request_count”: “100”,
“max_model_on_node”: “10”,
“sync_up_job_interval_in_seconds”: “3”,
“max_ml_task_per_node”: “10”,
“max_load_model_tasks_per_node”: “10”,
“ml_task_timeout_in_seconds”: “600”,
“max_upload_model_tasks_per_node”: “10”,
“only_run_on_ml_node”: “false”,
“native_memory_threshold”: “100”,
“trusted_url_regex”: “^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~|!:,.;]*[-a-zA-Z0-9+&@#/%=~|]”
}

I see difference in “sync_up_job_interval_in_seconds” only

Gsmitt · April 12, 2023, 4:32am

Not sure why its timing out, I assume you restart Opensearch after making configuration. you have way more heap then myself. Kind wonder if there is a cache or something.

TonyStark · April 12, 2023, 4:40am

Ok. Thank you for your help in solving this error.

Gsmitt · April 12, 2023, 4:41am

Hey @TonyStark

Not sure if this will help but here is what i did after making thos configs shown above.

The following example request uploads version 1.0.0 of a natural language processing (NLP) sentence transformation model named all-MiniLM-L6-v2:

POST /_plugins/_ml/models/_upload
{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "test model",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

Then

OpenSearch responds with the task_id and task status:

{
  "task_id" : "ew8I44MBhyWuIwnfvDIH", 
  "status" : "CREATED"
}

This example request uses the task_id from the upload example.

GET /_plugins/_ml/tasks/ew8I44MBhyWuIwnfvDIH

OpenSearch responds with the model_id:

{
  "model_id" : "WWQI44MBbzI2oUKAvNUt", 
  "task_type" : "UPLOAD_MODEL",
  "function_name" : "TEXT_EMBEDDING",
  "state" : "COMPLETED",
  "worker_node" : "KzONM8c8T4Od-NoUANQNGg",
  "create_time" : 3455961564003,
  "last_update_time" : 3216361373241,
  "is_async" : true
}

Add the model_id to the load API:

POST /_plugins/_ml/models/<model_id>/_load

Results. Only thing different im only running 4 GB mem and 4 CPU, 200 GB drive.

TonyStark · April 12, 2023, 7:55am

I have uploaded the model.
Have you encountered this error -

{
“model_id”: “bpNldIcBJRDoDhYPizuI”,
“task_type”: “LOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“G2_UQ118RcyJEbydg2HWtw”,
“yBvD1Q_vSLi8si1SKUvm5Q”
],
“create_time”: 1681285508047,
“last_update_time”: 1681285642547,
“error”: “”“{“G2_UQ118RcyJEbydg2HWtw”:“Connection timed out”,“yBvD1Q_vSLi8si1SKUvm5Q”:“Connection timed out”}”“”,
“is_async”: true
}

TonyStark · April 12, 2023, 8:02am

Have you node with role ML in your cluster ?

Gsmitt · April 12, 2023, 10:30pm

Hey @TonyStark

I do

TonyStark · April 12, 2023, 11:36pm

Yes. I have the same configuration for these roles

Gsmitt · April 12, 2023, 11:41pm

Hey,

This is odd. Not sure whats going on. Looks like you have then upload already , judging from the screenshot. Have you check all your logs to find a clue?

EDIT: Maybe better yet , just an idea, remove those modules and start over?

TonyStark · April 13, 2023, 12:25am

Yes, I have already tried deleting the model and reloading it.

Logs form node (roles: ml,data):

TonyStark · April 13, 2023, 12:26am

Gsmitt · April 13, 2023, 12:37am

hey,

Correct me if im wrong here but do you have two node cluster? if so does the second node have the same settings/configurations as the first one?

Topic		Replies	Views
Error When Loading Embedding Model Into Memory Machine Learning	11	1129	October 2, 2023
Cannot upload ML model Machine Learning	3	784	June 18, 2023
Failed to Deploy Model OpenSearch	0	554	September 5, 2023
Errors when deploy ML Models to Opensearch cluster OpenSearch	1	295	July 24, 2024
Opensearch-ml TransportError Machine Learning	4	751	May 10, 2023

Error while loading ML model in ElasticSearch

Related topics