How can we deploy ML model (.zip) to nodes locally, not via SSL or the firewall

yeonghyeonKo · July 22, 2024, 10:34am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.15.0 (both OpenSearch and its Dashboards)
Kubernetes 1.25.6
Controlled by OpenSearch Operator (2.5.1)

Describe the issue:

PUT /_cluster/settings
{
  "persistent": {
    "plugins.ml_commons.allow_registering_model_via_url": true
  }
}

POST /_plugins/_ml/models/_register
{
    "name": "all-MiniLM-L6-v2",
    "version": "1.0.0",
    "description": "test model",
    "model_format": "TORCH_SCRIPT",
    "model_group_id": "03jj2ZABhH7d8NGgmFHv",
    "model_content_hash_value": "c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f",
    "model_config": {
        "model_type": "bert",
        "embedding_dimension": 384,
        "framework_type": "sentence_transformers",
       "all_config": "{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}"
    },
    "url": "https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"
}

The response tells me that there is something wrong(probably by firewall or SSL) when your OpenSearch cluster tries to get ML model(.zip) from external host:

{
  "task_type": "REGISTER_MODEL",
  "function_name": "QUESTION_ANSWERING",
  "state": "FAILED",
  "worker_node": [
    "qnOS9tufQXCgO8og8NUDbg"
  ],
  "create_time": 1721644031460,
  "last_update_time": 1721644034093,
  "error": "unable to find valid certification path to requested target",
  "is_async": true
}

I think this happends because of a firewall. Is there any way to ingest .zip itself into OpenSearch Cluster? (like tokenizer plugin)
(ex. Include ML models into Docker Image or mount Volumes using NFS)

The error logs are below:

[2024-07-23T01:01:54,052][INFO ][o.o.m.m.MLModelManager   ] [test-opensearch-cluster-data-0] create new model meta doc _mkb3ZAB0bsFbRwv7vBw for register model task VXgb3ZABhH7d8NGg7ZSU
[2024-07-23T01:01:54,878][ERROR][o.o.m.m.MLModelManager   ] [test-opensearch-cluster-data-0] Failed to index chunk file
java.security.PrivilegedActionException: null
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:575) ~[?:?]
	at org.opensearch.ml.engine.ModelHelper.downloadAndSplit(ModelHelper.java:267) [opensearch-ml-algorithms-2.15.0.0.jar:?]
	at org.opensearch.ml.model.MLModelManager.registerModel(MLModelManager.java:724) [opensearch-ml-2.15.0.0.jar:2.15.0.0]
	at org.opensearch.ml.model.MLModelManager.lambda$registerModelFromUrl$31(MLModelManager.java:699) [opensearch-ml-2.15.0.0.jar:2.15.0.0]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.15.0.jar:2.15.0]
	at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.15.0.jar:2.15.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:941) [opensearch-2.15.0.jar:2.15.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.15.0.jar:2.15.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) ~[?:?]
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) ~[?:?]
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486) ~[?:?]
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:2055) ~[?:?]
	at java.base/sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:2050) ~[?:?]
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:2049) ~[?:?]
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1619) ~[?:?]
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1599) ~[?:?]
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:223) ~[?:?]
	at ai.djl.training.util.DownloadUtils.download(DownloadUtils.java:78) ~[?:?]
	at ai.djl.training.util.DownloadUtils.download(DownloadUtils.java:52) ~[?:?]
	at ai.djl.training.util.DownloadUtils.download(DownloadUtils.java:52) ~[?:?]
	at org.opensearch.ml.engine.ModelHelper.lambda$downloadAndSplit$3(ModelHelper.java:273) ~[?:?]
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
	... 10 more
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) ~[?:?]
...

I found some issues similar with mine, but it wasn't helpful.

My OpenSearch Cluster is based on OpenSearch Operator, and the manifest (CRD) is like:

opensearchCluster:
  enabled: true
  general:
    httpPort: "9200"
    image: harbor-srep01.xxx.com/library/opensearchproject/opensearch:v2.15.0
    serviceName: "test-opensearch-cluster"
    drainDataNodes: true
    # https://github.com/opensearch-project/opensearch-k8s-operator/blob/main/docs/userguide/main.md#security-context-for-pods-and-containers
    setVMMaxMapCount: true # In some cases, set general.setVMMaxMapCount to false as this feature also launches an init container with root
    podSecurityContext:
      runAsUser: 1000
      runAsGroup: 1000
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
  # https://github.com/opensearch-project/opensearch-k8s-operator/blob/main/docs/userguide/main.md#deal-with-max-virtual-memory-areas-vmmax_map_count-errors
  # https://github.com/opensearch-project/opensearch-k8s-operator/blob/main/docs/userguide/main.md#custom-init-helper
  initHelper:
    image: "harbor-srep01.xxx.com/nexus/docker-mig/library/busybox:1.31.1"
    imagePullPolicy: IfNotPresent
  dashboards:
    enable: true
    replicas: 1
    image: harbor-srep01.xxx.com/library/opensearchproject/opensearch-dashboards:v2.15.0
    resources:
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "1Gi"
        cpu: "500m"
    tls:
      enable: false
    opensearchCredentialsSecret:
      name: admin-credentials-secret
    additionalConfig:
      # https://opensearch.org/docs/latest/install-and-configure/install-dashboards/tls/
      opensearch.ssl.verificationMode: none
  nodePools:
    - component: master
      replicas: 3
      pdb:
        enable: false
        # enable: true
        # minAvailable: 1
      diskSize: "10Gi"
      persistence:
        pvc:
          storageClass: "sc-nfs-app-retain"
          accessModes:
           - ReadWriteOnce
      roles:
        - "cluster_manager"
        - "master"
      # https://github.com/opensearch-project/opensearch-k8s-operator/issues/669#issuecomment-1829833573
      # Suggestion: 1000m CPU & 2048Mi memory
      resources:
        requests:
          memory: "4Gi"
          cpu: "1"
        limits:
          memory: "4Gi"
          cpu: "2"
      env:
        - name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
          value: "hcpOss12~!"
    - component: data
      replicas: 2
      diskSize: "100Gi"
      persistence:
        pvc:
          storageClass: "sc-nfs-app-retain"
          accessModes:
           - ReadWriteOnce
      roles:
        - "data"
        - "ingest"
        - "ml"
      resources:
        requests:
          memory: "8Gi"
          cpu: "2"
        limits:
          memory: "8Gi"
          cpu: "4"
      env:
        - name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
          value: "hcpOss12~!"
  security:
    tls:
      transport:
        generate: true
        perNode: true
      # https://opensearch-project.github.io/opensearch-k8s-operator/docs/userguide/main.html#node-httprest-api
      http:
        generate: true
    config:
      adminCredentialsSecret: # these are the admin credentials for the Operator to use
         name: admin-credentials-secret
      securityConfigSecret:  # this is the whole security configuration for OpenSearch
         name: securityconfig-secret

yeonghyeonKo · July 23, 2024, 2:47am

I downloaded the file(.zip for ML Model) from opensearch in private CDN so could detour SSL verification or the firewall.

FROM : http://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip

TO : http://[your_cdn_or_web_server]/dl/rpBR9Pqb/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip

POST /_plugins/_ml/models/_register
{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "test model",
  "model_format": "TORCH_SCRIPT",
  "model_group_id": "03jj2ZABhH7d8NGgmFHv",
  "model_content_hash_value": "c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers",
    "all_config": "{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}"
  },
  "url": "http://xxx.com/api/public/dl/rpBR9Pqb/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"
}

yeonghyeonKo · July 23, 2024, 5:08am

But the real problem happens when deploying the model (already registered).
(My ML-only node id is cS8nMP3RQYy2TYWR_s0BVg)

nodeId name                             node.roles      ip            cpu heap.percent heap.max
6Jhm   test-opensearch-cluster-master-1 cluster_manager 10.251.44.199   2           17      2gb
cS8n   test-opensearch-cluster-ml-0     ml              10.251.30.255   0            8     12gb
BncZ   test-opensearch-cluster-master-0 cluster_manager 10.251.60.48    0           10      2gb
-sDd   test-opensearch-cluster-master-2 cluster_manager 10.251.30.105   1           47      2gb
qnOS   test-opensearch-cluster-data-0   data,ingest     10.251.44.226   0           19      4gb
Je4t   test-opensearch-cluster-data-1   data,ingest     10.251.60.96    0           50      4gb

The exact process I did was:

1. POST /_plugins/_ml/models/_register

2. GET /_plugins/ml/tasks/tVrx3ZABzAOtP-lzn-7

# response
{
  "model_id": "IUjs3ZABcmsla-1ey5tq",
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "cS8nMP3RQYy2TYWR_s0BVg"
  ],
  "create_time": 1721710530235,
  "last_update_time": 1721710544558,
  "is_async": true
}

3. GET /_plugins/_ml/models/IUjs3ZABcmsla-1ey5tq

# response
{
  "name": "all-MiniLM-L6-v2",
  "model_group_id": "03jj2ZABhH7d8NGgmFHv",
  "algorithm": "TEXT_EMBEDDING",
  "model_version": "10",
  "description": "test model",
  "model_format": "TORCH_SCRIPT",
  "model_state": "DEPLOY_FAILED",
  "model_content_size_in_bytes": 91790008,
  "model_content_hash_value": "c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f",
  "model_config": {
    ...
  },
  "created_time": 1721710201701,
  "last_updated_time": 1721710453773,
  "last_registered_time": 1721710214983,
  "last_deployed_time": 1721710453773,
  "total_chunks": 10,
  "planning_worker_node_count": 1,
  "current_worker_node_count": 0,
  "planning_worker_nodes": [
    "cS8nMP3RQYy2TYWR_s0BVg"
  ],
  "deploy_to_all_nodes": true,
  "is_hidden": false
}

4. POST /_plugins/_ml/models/IUjs3ZABcmsla-1ey5tq/_deploy

# response 
{
  "model_id": "IUjs3ZABcmsla-1ey5tq",
  "task_type": "DEPLOY_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "FAILED",
  "worker_node": [
    "cS8nMP3RQYy2TYWR_s0BVg"
  ],
  "create_time": 1721710444627,
  "last_update_time": 1721710453773,
  "error": """{"cS8nMP3RQYy2TYWR_s0BVg":"unable to find valid certification path to requested target"}""",
  "is_async": true
}

5. POST /_plugins/_ml/_predict/text_embedding/IUjs3ZABcmsla-1ey5tq

# request
{
  "text_docs":[ "today is sunny"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

# response
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Model not ready yet. Please deploy the model first."
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Model not ready yet. Please deploy the model first."
  },
  "status": 400
}

There is query parameter(deploy=true) for POST _register API, but it doesn’t work because it’s just the same operation with POST _deploy API.

Even I look at the official document of OpenSearch, there is no option for deploying the model while detouring “unable to find valid certification path to requested target” issue.

yeonghyeonKo · July 24, 2024, 5:28am

@Gsmitt Sorry for the confusion. Though I registered ML Model using private CDN url, deployment is the real problem.

OpenSearch uses djl library to load models externally, and each pretrained model would be downloaded by downloadPyTorch(code) and openUrl(code) method.

Do you have any idea for it?
(ref: Model deployment failure with ml-commons plugin in internet disabled environment - #3 by amank)

While trying to deploy a model, connecting to PyTorch fails (I think this is because the phase for depoyment doesn’t use private CDN.)

test-opensearch-cluster-ml-0 opensearch [2024-07-24T05:23:58,955][INFO ][o.o.m.m.MLModelManager   ] [test-opensearch-cluster-ml-0] Initializing the rate limiter with setting 4.0 per MINUTES (TPS limit 0.06666666666666667), evenly distributed on 1 nodes
test-opensearch-cluster-ml-0 opensearch [2024-07-24T05:23:58,955][INFO ][o.o.m.m.MLModelManager   ] [test-opensearch-cluster-ml-0] Initializing the rate limiter with setting 4.0 per MINUTES (TPS limit 0.06666666666666667), evenly distributed on 1 nodes
test-opensearch-cluster-ml-0 opensearch [2024-07-24T05:23:58,955][INFO ][o.o.m.m.MLModelManager   ] [test-opensearch-cluster-ml-0] Successfully redeployed model controller for model Hd-83pABm0qUjmUv3l96
test-opensearch-cluster-ml-0 opensearch [2024-07-24T05:24:03,470][WARN ][a.d.p.j.LibUtils         ] [test-opensearch-cluster-ml-0] Override PyTorch version: 1.13.1.
test-opensearch-cluster-ml-0 opensearch [2024-07-24T05:24:03,553][ERROR][o.o.m.e.a.DLModel        ] [test-opensearch-cluster-ml-0] Failed to deploy model Hd-83pABm0qUjmUv3l96
test-opensearch-cluster-ml-0 opensearch ai.djl.engine.EngineException: Failed to save pytorch index file
test-opensearch-cluster-ml-0 opensearch         at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[pytorch-engine-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[pytorch-engine-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[pytorch-engine-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[pytorch-engine-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[pytorch-engine-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[pytorch-engine-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[api-0.28.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:188) ~[opensearch-ml-algorithms-2.15.0.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) [opensearch-ml-algorithms-2.15.0.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) [?:?]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) [opensearch-ml-algorithms-2.15.0.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) [opensearch-ml-algorithms-2.15.0.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-2.15.0.0.jar:?]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1067) [opensearch-ml-2.15.0.0.jar:2.15.0.0]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.15.0.jar:2.15.0]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$73(MLModelManager.java:1680) [opensearch-ml-2.15.0.0.jar:2.15.0.0]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.15.0.jar:2.15.0]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.15.0.jar:2.15.0]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:941) [opensearch-2.15.0.jar:2.15.0]
test-opensearch-cluster-ml-0 opensearch         at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.15.0.jar:2.15.0]
test-opensearch-cluster-ml-0 opensearch         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
test-opensearch-cluster-ml-0 opensearch         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
test-opensearch-cluster-ml-0 opensearch         at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
test-opensearch-cluster-ml-0 opensearch Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Gsmitt · July 24, 2024, 9:10pm

Hey @yeonghyeonKo

Did you tag the right person?

zane_neo · August 2, 2024, 2:49am

@yeonghyeonKo Have a try to configure a system env here: djl/engines/pytorch/pytorch-engine/src/main/java/ai/djl/pytorch/jni/LibUtils.java at master · deepjavalibrary/djl · GitHub, this can avoid the network call.

yeonghyeonKo · August 2, 2024, 5:52am

@zane_neo
yes, djl gives us an option for offline environment (link),

But, I’m not sure specifying PYTORCH_LIBRARY_PATH (with PYTORCH_VERSION, PYTORCH_FLAVOR) is possible when you are developing via OpenSearch Dashboards.

Using loadModel method in DLModel.java, OpenSearch invokes getEngine from djl with “engine” as an argument (string type; PyTorch or OnnxRuntime) (code).

I deployed OpenSearch Cluster by k8s-operator, and it’s allowed to inject ENV in ML nodes. (opensearchCluster.nodePools.env).
This is an example:

spec:
  nodePools:
    - component: ml
      replicas: 1
      env:
        - name: PYTORCH_LIBRARY_PATH 
          value: "/usr/lib/python3.10/site-packages/torch/lib"
      roles:
        - "ml"
      diskSize: "10Gi"

Is the above way what you meant, @zane_neo ?
It it is correct, then I think it is possible to replace /usr/lib/python3.10/site-packages/torch/lib path using PVC volume.

zane_neo · August 2, 2024, 8:56am

PYTORCH_LIBRARY_PATH is not supported to configure via dashboard. The configuration you provided should work, please have a try.

yeonghyeonKo · August 19, 2024, 1:32am

@zane_neo Before I’ve tried inject configuration files for ML Model in spec.nodePools[].env, it’s necessary to check how many and what kind of files should be included. So I tested two environments each,

one is using a public internet environment tested at home and
the other is still a restricted environment because of closed network in the company.
(There should be proxy issue when using Java Process to download files)

First, when I tried to register/deploy ML model in a public internet env, I can’t see any error or exception about SSL so that the model is successfully deployed to all nodes in the OpenSearch Cluster. Inside the pod(running for a ML node), there are newly added folders(pytorch and tokenizers) in opensearch/data/ml_cache including the below files.

$ tree

├── pytorch
│   ├── 1.13.1-cpu-precxx11-linux-x86_64
│   │   ├── 0.28.0-libdjl_torch.so
│   │   ├── libc10.so
│   │   ├── libgomp-a34b3233.so.1
│   │   ├── libstdc++.so.6
│   │   ├── libtorch.so
│   │   └── libtorch_cpu.so
│   └── 1.13.1.txt
└── tokenizers
    └── 0.19.1-0.28.0-linux-x86_64
        └── libtokenizers.so

So I extracted two folders with all files from container to localhost, and then sent them the Closed-Network env. Finally I used the below codes to put them in containers from localhost.

$ docker cp pytorch/ opensearch-node1:/usr/share/opensearch/data/ml_cache
$ docker cp pytorch/ opensearch-node2:/usr/share/opensearch/data/ml_cache
$ docker cp tokenizers/ opensearch-node1:/usr/share/opensearch/data/ml_cache
$ docker cp tokenizers/ opensearch-node2:/usr/share/opensearch/data/ml_cache

With this method, we can easily pre-include torch files with a new docker image like:

FROM opensearchproject/opensearch:2.16.0

# Switch to root user for installation
USER root

# Set the working directory
WORKDIR /usr/share/opensearch

# Copy the necessary files
COPY ./pytorch /data/ml_cache
COPY ./tokenizers /data/ml_cache

# Grant execute permissions to the script
RUN chmod +x /data/ml_cache/pytorch
RUN chmod +x /data/ml_cache/tokenizers

# Switch back to the default non-root user
USER opensearch

In this scenario, without using spec.nodePools[].env, I could deployed ML models in the Closed-Network environment.

Topic		Replies	Views
How to upload an ML model in offline mode on OpenSearch 2.17? Machine Learning	10	49	July 17, 2025
Errors when deploy ML Models to Opensearch cluster OpenSearch	1	306	July 24, 2024
OpenSearch 2.9 ML Framework Model Upload Not Working Machine Learning	19	2077	September 30, 2023
"error": "unable to find valid certification path to requested target", OpenSearch troubleshoot	32	2120	October 11, 2024
Offline deployment pretrained of models Machine Learning	3	454	September 23, 2024

How can we deploy ML model (.zip) to nodes locally, not via SSL or the firewall

Related topics