ML Model has to be re deployed each time ML Node is restarted

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Opensearch Version: 2.16 On-premise
OS: Ubuntu 22.04
Cluster Nodes: 3 Master, 5 Data, 2 Ingest, 1 ML
Model: huggingface/sentence-transformers/all-mpnet-base-v2

Describe the issue:

Anytime I restart my Opensearch ML Node I have to re-deploy all my models. I have configured my cluster so ML only runs on one node.

Configuration:

plugins.ml_commons.only_run_on_ml_node: true
plugins.ml_commons.task_dispatch_policy: round_robin
plugins.ml_commons.max_ml_task_per_node: 10
plugins.ml_commons.max_model_on_node: 10
plugins.ml_commons.sync_up_job_interval_in_seconds: 3
plugins.ml_commons.monitoring_request_count: 100
plugins.ml_commons.max_register_model_tasks_per_node: 10
plugins.ml_commons.max_deploy_model_tasks_per_node: 10
plugins.ml_commons.allow_registering_model_via_url: false
plugins.ml_commons.allow_registering_model_via_local_file: false
plugins.ml_commons.ml_task_timeout_in_seconds: 600
plugins.ml_commons.native_memory_threshold: 90
plugins.ml_commons.allow_custom_deployment_plan: false
plugins.ml_commons.model_auto_redeploy.enable: true
plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 10
plugins.ml_commons.model_auto_redeploy_success_ratio: 1
plugins.ml_commons.enable_inhouse_python_model: false
plugins.ml_commons.connector_access_control_enabled: true

Relevant Logs or Screenshots:

Run the following:

POST /_plugins/_ml/_predict/text_embedding/Rt3xuJEBma2YeRGkgGm2
{
  "text_docs":[ "today is sunny"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

Response:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Model not ready yet. Please deploy the model first."
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Model not ready yet. Please deploy the model first."
  },
  "status": 400
}

In Opensearch Dashboard it shows the model is not responding. This query:

POST /_plugins/_ml/models/_search
{
  "query": {
        "bool": {
            "must": [
              {"term": {"model_group_id": "defTuJEBqJ_hscxSw4UL"}}
            ]
        }
  }
}

Shows my model_state is DEPLOYED_FAILED

If we run the following: POST _plugins/_ml/models/Rt3xuJEBma2YeRGkgGm2/_deploy

The model shows as responding in the UI and the above _pedict call returns some sentence embeddings. Every time the node is restarted, this step needs to be performed. The time it takes to _deploy the model is quite quick(< 60 secs) usually.

In my logs I see:

[2024-09-03T18:13:01,567][INFO ][o.o.m.a.s.TransportSyncUpOnNodeAction] [ml01] ML model not in cache. Remove all of its cache files. model id: Rt3xuJEBma2YeRGkgGm2

I’m not sure where the cache is suppose to be.

Can you check the path opensearch/data/ml_cache has files like below? :

├── pytorch
│   ├── 1.13.1-cpu-precxx11-linux-x86_64
│   │   ├── 0.28.0-libdjl_torch.so
│   │   ├── libc10.so
│   │   ├── libgomp-a34b3233.so.1
│   │   ├── libstdc++.so.6
│   │   ├── libtorch.so
│   │   └── libtorch_cpu.so
│   └── 1.13.1.txt
└── tokenizers
    └── 0.19.1-0.28.0-linux-x86_64
        └── libtokenizers.so

The folder does exist, I stopped the service, and renamed it to ml_cache.bak, and restarted the service. Once I re-deployed the model the following was logged:

[2024-09-04T13:49:22,023][ERROR][o.o.m.m.MLModelManager   ] [osml] No controller is deployed because the model Rt3xuJEBma2YeRGkgGm2 is expected not having an enabled model controller. Please use the create model controller api to create one if this is unexpected.
[2024-09-04T13:49:45,421][WARN ][a.d.u.Ec2Utils           ] [osml] Security manager doesn't allow access file
[2024-09-04T13:49:45,432][WARN ][a.d.u.c.CudaUtils        ] [osml] Access denied during loading cudart library.
[2024-09-04T13:49:45,433][WARN ][a.d.p.j.LibUtils         ] [osml] Override PyTorch version: 1.13.1.
[2024-09-04T13:49:45,556][INFO ][a.d.p.j.LibUtils         ] [osml] Downloading https://publish.djl.ai/pytorch/1.13.1/cpu-precxx11/linux-x86_64/native/lib/libgomp-a34b3233.so.1.gz ...
[2024-09-04T13:49:45,612][INFO ][a.d.p.j.LibUtils         ] [osml] Downloading https://publish.djl.ai/pytorch/1.13.1/cpu-precxx11/linux-x86_64/native/lib/libc10.so.gz ...
[2024-09-04T13:49:45,652][INFO ][a.d.p.j.LibUtils         ] [osml] Downloading https://publish.djl.ai/pytorch/1.13.1/cpu-precxx11/linux-x86_64/native/lib/libtorch_cpu.so.gz ...
[2024-09-04T13:49:51,564][INFO ][a.d.p.j.LibUtils         ] [osml] Downloading https://publish.djl.ai/pytorch/1.13.1/cpu-precxx11/linux-x86_64/native/lib/libtorch.so.gz ...
[2024-09-04T13:49:51,593][INFO ][a.d.p.j.LibUtils         ] [osml] Downloading https://publish.djl.ai/pytorch/1.13.1/cpu-precxx11/linux-x86_64/native/lib/libstdc%2B%2B.so.6.gz ...
[2024-09-04T13:49:52,863][INFO ][a.d.p.j.LibUtils         ] [osml] Downloading jni https://publish.djl.ai/pytorch/1.13.1/jnilib/0.28.0/linux-x86_64/cpu-precxx11/libdjl_torch.so to cache ...
[2024-09-04T13:49:53,397][INFO ][a.d.p.e.PtEngine         ] [osml] PyTorch graph executor optimizer is enabled, this may impact your inference latency and throughput. See: https://docs.djl.ai/docs/development/inference_performance_optimization.html#graph-executor-optimization

After this occurs the model shows as ready and works, after restart it stopped working again. I checked the data folder and the ml_cache folder was re-created.

In the directory I see:

ls -la ./data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/
total 521720
drwx------ 2 opensearch opensearch      4096 Sep  4 13:50 .
drwxr-xr-x 3 opensearch opensearch      4096 Sep  4 13:50 ..
-rw-r--r-- 1 opensearch opensearch   3503320 Sep  4 13:50 0.28.0-libdjl_torch.so
-rw-r--r-- 1 opensearch opensearch    903161 Sep  4 13:49 libc10.so
-rw-r--r-- 1 opensearch opensearch    168721 Sep  4 13:49 libgomp-a34b3233.so.1
-rw-r--r-- 1 opensearch opensearch    995840 Sep  4 13:50 libstdc++.so.6
-rw-r--r-- 1 opensearch opensearch      7192 Sep  4 13:50 libtorch.so
-rw-r--r-- 1 opensearch opensearch 526535313 Sep  4 13:50 libtorch_cpu.so