Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch Version: 2.16 On-premise
OS: Ubuntu 22.04
Cluster Nodes: 3 Master, 5 Data, 2 Ingest, 1 ML
Model: huggingface/sentence-transformers/all-mpnet-base-v2
Describe the issue:
Anytime I restart my Opensearch ML Node I have to re-deploy all my models. I have configured my cluster so ML only runs on one node.
Configuration:
plugins.ml_commons.only_run_on_ml_node: true
plugins.ml_commons.task_dispatch_policy: round_robin
plugins.ml_commons.max_ml_task_per_node: 10
plugins.ml_commons.max_model_on_node: 10
plugins.ml_commons.sync_up_job_interval_in_seconds: 3
plugins.ml_commons.monitoring_request_count: 100
plugins.ml_commons.max_register_model_tasks_per_node: 10
plugins.ml_commons.max_deploy_model_tasks_per_node: 10
plugins.ml_commons.allow_registering_model_via_url: false
plugins.ml_commons.allow_registering_model_via_local_file: false
plugins.ml_commons.ml_task_timeout_in_seconds: 600
plugins.ml_commons.native_memory_threshold: 90
plugins.ml_commons.allow_custom_deployment_plan: false
plugins.ml_commons.model_auto_redeploy.enable: true
plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 10
plugins.ml_commons.model_auto_redeploy_success_ratio: 1
plugins.ml_commons.enable_inhouse_python_model: false
plugins.ml_commons.connector_access_control_enabled: true
Relevant Logs or Screenshots:
Run the following:
POST /_plugins/_ml/_predict/text_embedding/Rt3xuJEBma2YeRGkgGm2
{
"text_docs":[ "today is sunny"],
"return_number": true,
"target_response": ["sentence_embedding"]
}
Response:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Model not ready yet. Please deploy the model first."
}
],
"type": "illegal_argument_exception",
"reason": "Model not ready yet. Please deploy the model first."
},
"status": 400
}
In Opensearch Dashboard it shows the model is not responding. This query:
POST /_plugins/_ml/models/_search
{
"query": {
"bool": {
"must": [
{"term": {"model_group_id": "defTuJEBqJ_hscxSw4UL"}}
]
}
}
}
Shows my model_state is DEPLOYED_FAILED
If we run the following: POST _plugins/_ml/models/Rt3xuJEBma2YeRGkgGm2/_deploy
The model shows as responding in the UI and the above _pedict call returns some sentence embeddings. Every time the node is restarted, this step needs to be performed. The time it takes to _deploy the model is quite quick(< 60 secs) usually.
In my logs I see:
[2024-09-03T18:13:01,567][INFO ][o.o.m.a.s.TransportSyncUpOnNodeAction] [ml01] ML model not in cache. Remove all of its cache files. model id: Rt3xuJEBma2YeRGkgGm2
I’m not sure where the cache is suppose to be.