Offline deployment pretrained of models

ark202 · May 21, 2024, 4:25am

OpenSearch 2.12.0 :

Attempting to deploy the pretrained models on a server which has no internet access:
Model being deployed https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-distilroberta-v1/1.0.1/torch_script/sentence-transformers_all-distilroberta-v1-1.0.1-torch_script.zip

Followed the below steps as mentioned in the Set up an ML language model doc

Prerequisite settings

PUT _cluster/settings
{
  "persistent": {
    "plugins": {
      "ml_commons": {
        "only_run_on_ml_node": "true",
        "model_access_control_enabled": "true",
        "native_memory_threshold": "99",
        "allow_registering_model_via_url": "true"
      }
    }
  }
}

Register a model group

POST /_plugins/_ml/model_groups/_register
{
  "name": "ml_model_group_sentence_transformers",
  "description": "A model group for sentence transformer",
  "access_mode": "public"
}

Check model group

GET _plugins/_ml/model_groups/jFICf48BvNyDZcmaRMm1

{
  "name": "ml_model_group_sentence_transformers",
  "latest_version": 12,
  "description": "A model group for sentence transformer",
  "owner": {
    "name": "admin",
    "backend_roles": [
      "admin"
    ],
    "roles": [
      "own_index",
      "all_access"
    ],
    "custom_attribute_names": [],
    "user_requested_tenant": "admin_tenant"
  },
  "access": "public",
  "created_time": 1715822806196,
  "last_updated_time": 1716260811689
}

Register a model.

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/all-distilroberta-v1",
  "version": "1.0.1",
  "model_group_id": "IE5fX48BvNyDZcmaO4Wy",
  "model_format": "TORCH_SCRIPT"
}

This step was failing as the the plugin was trying to download the model from the internet.

{
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "FAILED",
  "worker_node": [
    "_dpa206sRXeuoA6LVIVvgA"
  ],
  "create_time": 1715292140960,
  "last_update_time": 1715292273277,
  "error": "Connection timed out",
  "is_async": true
}

So I had to configure an internal mirror for the ML model artifacts and then the following worked

Registering the model again

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L6-v2",
  "version": "1.0.1",
  "description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.",
  "model_task_type": "TEXT_EMBEDDING",
  "model_format": "TORCH_SCRIPT",
  "model_content_size_in_bytes": 91790008,
  "model_content_hash_value": "c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers",
    "all_config": """{"_name_or_path":"nreimers/MiniLM-L6-H384-uncased","architectures":["BertModel"],"attention_probs_dropout_prob":0.1,"gradient_checkpointing":false,"hidden_act":"gelu","hidden_dropout_prob":0.1,"hidden_size":384,"initializer_range":0.02,"intermediate_size":1536,"layer_norm_eps":1e-12,"max_position_embeddings":512,"model_type":"bert","num_attention_heads":12,"num_hidden_layers":6,"pad_token_id":0,"position_embedding_type":"absolute","transformers_version":"4.8.2","type_vocab_size":2,"use_cache":true,"vocab_size":30522}"""
  },
  "created_time": 1676328997102,
  "url": "https://some-internal-mirror.com/opensearch/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip",
  "model_group_id": "jFICf48BvNyDZcmaRMm1"
}

Model registration completed

{
  "model_id": "kaxYmY8BPuGHzr9Two-b",
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "bXBS993MSreq8GU4eW8dhw"
  ],
  "create_time": 1716264682119,
  "last_update_time": 1716264695414,
  "is_async": true
}

The Model deployment step does not work

POST /_plugins/_ml/models/kaxYmY8BPuGHzr9Two-b/_deploy

GET /_plugins/_ml/tasks/BltZmY8BvNyDZcmadPbu

Initially the the failure was because of the plugins inability to download the pytorch libraries. I had then installed it using pip and added environment variables to point to the location

export PYTORCH_LIBRARY_PATH=$HOME/.local/lib/python3.9/site-packages/torch/lib/
export PYTORCH_VERSION=1.13.1
export PYTORCH_FLAVOR=cpu

I also had to grant permissions to the Java Security Manager to the above location.

Now the plugin is able to read the pytorch shared objects. It is currently failing to load the DJL JNI library

org.opensearch.ml.common.exception.MLException: Failed to deploy model kaxYmY8BPuGHzr9Two-b
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:294) ~[?:?]
at java.base/java.security.AccessController.doPrivileged(AccessController.java:569) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:247) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:139) ~[?:?]
at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1020) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: ai.djl.engine.EngineException: Cannot download jni files: https://publish.djl.ai/pytorch/1.13.1/jnilib/0.21.0/linux-x86_64/cpu/libdjl_torch.so

I tried adding the additional environment variables hoping that the pytorch engine can locate these libraries (I’ve placed the libdjl_torch.so in that path ), however that it did not work

export ENGINE_CACHE_DIR=$HOME/.djl.ai/
export DJL_OFFLINE=true

Let me know what is the procedure to allow these libraries to be loaded in offline mode. Am I missing something here.

xprtslpr · July 11, 2024, 8:53am

Hello ark202, have you been able to proceed any further in your endeavour to deploy a ML Model on an offline system?
I am running in similar issues.

yeonghyeonKo · July 25, 2024, 7:03am

@ark202 @xprtslpr
Hi, it’s because OpenSearch deploys the model you’d registered using DJL(Deep Java Library).

As you can see the code for LibUtils.java in DJL(code), _deploy API internally invokes downloadPyTorch method if DJL_OFFLINE or ai.djl.offline system env is false(code).

I think @ark202 tried to download the pytorch libraries (same with OpenSearch’s pretrained models) before DJL does, but error logs told you that loadModel which needs internet connection in DLModel.java was invoked.

Can you try offline mode by injecting DJL_OFFLINE env and share your result?

system · September 23, 2024, 7:03am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Offline deployment of pretrained ML models in OpenSearch 2.18.0 — resolving DJL JNI library loading failure Machine Learning troubleshoot , configure	2	30	October 4, 2025
OpenSearch 2.9 ML Framework Model Upload Not Working Machine Learning	19	2184	September 30, 2023
How to upload an ML model in offline mode on OpenSearch 2.17? Machine Learning	11	190	September 15, 2025
How can we deploy ML model (.zip) to nodes locally, not via SSL or the firewall OpenSearch discuss , troubleshoot , configure , install	8	652	August 19, 2024
Model deployment failure with ml-commons plugin in internet disabled environment Machine Learning discuss , troubleshoot , configure , install	3	1318	November 12, 2023

Offline deployment pretrained of models

Related topics