Error while loading ML model in ElasticSearch

Hi. I have a question. When i am trying to load simple integrated model through API in ElasticSeearch 2.6:

POST /_plugins/_ml/models/_upload
{
“name”: “huggingface/sentence-transformers/all-MiniLM-L12-v2”,
“version”: “1.0.1”,
“model_format”: “TORCH_SCRIPT”
}

Got response with the error:
{
“task_type”: “UPLOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“yBvD1Q_vSLi8si1SKUvm5Q”
],
“create_time”: 1681260277521,
“last_update_time”: 1681260277551,
“error”: “Native Memory Circuit Breaker is open, please check your resources!”,
“is_async”: true
}

Tasks are always executed on data nodes with 32GB RAM (28GB for JVM), 4TB HDD, and 16 CPU.
Data nodes are not loaded at all.
I think that these resources should be enough to load the model.
Who knows ? What is the reason for this error?

Hey @TonyStark

I just had this happen, my fix was:

Check my stat’s

GET _nodes/stats/breaker

Then I found by changing this

From

plugins.ml_commons.native_memory_threshold: 90

To

plugins.ml_commons.native_memory_threshold: 100

Restart service.

Take note, there is a reason why the doc say set it to 90.

Shown here

Thank you. This helped in solving this problem.
I set plugins.ml_commons.native_memory_threshold to 100 without restarting nodes. I assumed that this setting is dynamic. Аnd tried loading the model again.

But when i execute:
GET /_plugins/_ml/tasks/W2ikc4cB2GhI_wXsfqRg

Got response:
{
“task_type”: “UPLOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “CREATED”,
“worker_node”: [
“G2_UQ118RcyJEbydg2HWtw”
],
“create_time”: 1681272372831,
“last_update_time”: 1681272372831,
“is_async”: true
}

The answer must be returned ModelId. After some time, I execute again:
GET /_plugins/_ml/tasks/W2ikc4cB2GhI_wXsfqRg

And I get response with the next error:
{
“task_type”: “UPLOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“G2_UQ118RcyJEbydg2HWtw”
],
“create_time”: 1681272372831,
“last_update_time”: 1681272503894,
“error”: “Connection timed out”,
“is_async”: true
}

“ml_task_timeout_in_seconds”: “600” must be changed :))

1 Like

Yeah @TonyStark

Really just did this yesterday, took like 8 hours to understand what was happening to resolve it.
I’m fairly new at this.

However, the timeout is triggered before 600 seconds.

let me check my configs

Do you have something like this?

######## Start OpenSearch Machine learning ########
plugins.ml_commons.task_dispatch_policy: round_robin
plugins.ml_commons.max_ml_task_per_node: 10
plugins.ml_commons.max_model_on_node: 10
plugins.ml_commons.sync_up_job_interval_in_seconds: 5
plugins.ml_commons.monitoring_request_count: 100
plugins.ml_commons.max_upload_model_tasks_per_node: 10
plugins.ml_commons.max_load_model_tasks_per_node: 10
plugins.ml_commons.ml_task_timeout_in_seconds: 600
plugins.ml_commons.native_memory_threshold: 100
plugins.ml_commons.only_run_on_ml_node: false

My config:

“ml_commons”: {
“task_dispatch_policy”: “round_robin”,
“monitoring_request_count”: “100”,
“max_model_on_node”: “10”,
“sync_up_job_interval_in_seconds”: “3”,
“max_ml_task_per_node”: “10”,
“max_load_model_tasks_per_node”: “10”,
“ml_task_timeout_in_seconds”: “600”,
“max_upload_model_tasks_per_node”: “10”,
“only_run_on_ml_node”: “false”,
“native_memory_threshold”: “100”,
“trusted_url_regex”: “^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~|!:,.;]*[-a-zA-Z0-9+&@#/%=~|]”
}

I see difference in “sync_up_job_interval_in_seconds” only

Not sure why its timing out, I assume you restart Opensearch after making configuration. you have way more heap then myself. Kind wonder if there is a cache or something.

Ok. Thank you for your help in solving this error.

1 Like

Hey @TonyStark

Not sure if this will help but here is what i did after making thos configs shown above.

The following example request uploads version 1.0.0 of a natural language processing (NLP) sentence transformation model named all-MiniLM-L6-v2:

POST /_plugins/_ml/models/_upload
{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "test model",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

Then

OpenSearch responds with the task_id and task status:

{
  "task_id" : "ew8I44MBhyWuIwnfvDIH", 
  "status" : "CREATED"
}

This example request uses the task_id from the upload example.

GET /_plugins/_ml/tasks/ew8I44MBhyWuIwnfvDIH

OpenSearch responds with the model_id:

{
  "model_id" : "WWQI44MBbzI2oUKAvNUt", 
  "task_type" : "UPLOAD_MODEL",
  "function_name" : "TEXT_EMBEDDING",
  "state" : "COMPLETED",
  "worker_node" : "KzONM8c8T4Od-NoUANQNGg",
  "create_time" : 3455961564003,
  "last_update_time" : 3216361373241,
  "is_async" : true
}

Add the model_id to the load API:

POST /_plugins/_ml/models/<model_id>/_load

Results. Only thing different im only running 4 GB mem and 4 CPU, 200 GB drive.

I have uploaded the model.
Have you encountered this error -

{
“model_id”: “bpNldIcBJRDoDhYPizuI”,
“task_type”: “LOAD_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“G2_UQ118RcyJEbydg2HWtw”,
“yBvD1Q_vSLi8si1SKUvm5Q”
],
“create_time”: 1681285508047,
“last_update_time”: 1681285642547,
“error”: “”“{“G2_UQ118RcyJEbydg2HWtw”:“Connection timed out”,“yBvD1Q_vSLi8si1SKUvm5Q”:“Connection timed out”}”“”,
“is_async”: true
}

Have you node with role ML in your cluster ?

Hey @TonyStark

I do

Yes. I have the same configuration for these roles :slight_smile:

Hey,

This is odd. Not sure whats going on. Looks like you have then upload already , judging from the screenshot. Have you check all your logs to find a clue?

EDIT: Maybe better yet , just an idea, remove those modules and start over?

Yes, I have already tried deleting the model and reloading it.

Logs form node (roles: ml,data):

hey,

Correct me if im wrong here but do you have two node cluster? if so does the second node have the same settings/configurations as the first one?