How to upload an ML model in offline mode on OpenSearch 2.17?

MHR · July 15, 2025, 1:01pm

Hi everyone,

I’m using OpenSearch 2.17 and I would like to know:
What is the correct procedure to upload a machine learning model in offline mode (without internet access)?

I have already downloaded the model manually (in ONNX format).
I tested the _upload API using a file path, but I’m getting no response from OpenSearch, and the model doesn’t appear when I list models.

Thanks in advance for your help!

pablo · July 15, 2025, 2:28pm

@MHR What message did you receive? You need to enable model registration via local file in the OpenSearch cluster.

PUT _cluster/settings
{
  "persistent": {
    "plugins.ml_commons.allow_registering_model_via_local_file": "true",
  }
}

MHR · July 15, 2025, 2:42pm

Hi Pablo, thanks for you answer.

I already have this configuration in the cluster:
{
“persistent”: {
“plugins.ml_commons.allow_registering_model_via_url”: “true”,
“plugins.ml_commons.only_run_on_ml_node”: “true”,
“plugins.ml_commons.model_access_control_enabled”: “true”,
“plugins.ml_commons.native_memory_threshold”: “99”
}
}
Whats the next step please ?

pablo · July 15, 2025, 3:45pm

@MHR You can use Python module opensearch_py_ml to import the model.
Check this thread, it has a working Python script example.

MHR · July 15, 2025, 4:05pm

I tried this solution : Offline deployment pretrained of models - Plugins / Machine Learning - OpenSearch
I have the models registred in opensearch, but when i deploy i m having this error
{
“model_id”: “PtfADpgBFuMKdhl6fGIn_9”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“LL4EfJJZTNWRUKK0Vv5ohg”,
“QUVT0ivpSVecE4Nrvgdptw”,
“YYu50U1AR0Glj3YkTRUlEQ”
],
“create_time”: 1752595176699,
“last_update_time”: 1752595177056,
“error”: “”“{“LL4EfJJZTNWRUKK0Vv5ohg”:“Failed to deploy model”,“QUVT0ivpSVecE4Nrvgdptw”:“Failed to deploy model”,“YYu50U1AR0Glj3YkTRUlEQ”:“Failed to deploy model”}”“”,
“is_async”: true
}

pablo · July 15, 2025, 4:09pm

@MHR Did you test that with 2.17 only?

MHR · July 15, 2025, 5:29pm

Yes

pablo · July 15, 2025, 6:17pm

@MHR Did you test this script with pretrained model?
If so, could you share which one did you use?

Could you also check the OpenSearch logs? There will be more information than in that deployment task.

MHR · July 16, 2025, 1:33pm

I didnt test the script yet.

Here are the opensearch logs when i deploy a model:

[ERROR]No controller is deployed because the model PtfADpgBFuMKdhl6fGIn is expected not having an enabled model controller. Please use the create model controller api to create one if this is unexpected.
[ERROR] Failed to deploy model PtfADpgBFuMKdhl6fGIn
ai.djl.engine.EngineException: Failed to save pytorch index file
[ERROR] Failed to retrieve model PtfADpgBFuMKdhl6fGIn
Caused by: Failed to save pytorch index file
Caused by: java.io.IOException: Offline model is enabled.

pablo · July 16, 2025, 5:18pm

@MHR I’ve missed that offline mode. I’ve just tested and I’m getting the same error. I even installed PyTorch with the OpenSearch image and set DJL_OFFLINE, but no joy.

This issue has already been reported on GitHub.

github.com/opensearch-project/ml-commons

[FEATURE] Improve user experience when running in environments with limited internet access

opened 01:47AM - 23 Jan 24 UTC

ArranDengate-Netapp

enhancement

For clusters in a corporate setting, internet access is often restricted with an… egress firewall. However, the ML commons plugin needs internet access to download dependencies, even when using a local model. It would be good to improve the user experience in this situation. Some ideas: - Document the behaviour of the plugin, so the network needs of the plugin can be accommodated by the user (eg, by whitelisting known dependencies - or if dependencies will be unpredictable, we could advise avoiding this plugin in environments with restricted network access) - Provide a way to avoid downloading dependencies during model deployment (eg, is it possible to package dependencies?) - Improve logging so that, if downloading dependencies fails, it is clear which URL was unreachable - this would make it easier to update the whitelist. I see this behaviour when using the `all-MiniLM-L12-v2` model locally on OpenSearch 2.11.1, using the TorchScript model file and config from the [list of pre-trained models](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#sentence-transformers), deploying from a local zip file with the steps from `opensearch-py-ml`'s [demo notebook](https://opensearch-project.github.io/opensearch-py-ml/examples/demo_ml_commons_integration.html). I have made some suggestions based on my experience below, but I'm not sure if the ONNX model would have different dependencies than the TorchScript model, or if other models have different dependencies (eg, whether `all-mpnet-base-v2` is going to have different dependencies than `all-MiniLM-L12-v2`). **Packaging** When using a local Torch model on a server with restricted internet access, deploying the model fails if the server cannot access `publish.djl.ai`. In ml-commons code, this URL is mentioned by the `pytorch-engine` library. It might be possible to [package a fat jar with dependencies](https://github.com/deepjavalibrary/djl-demo/tree/master/development/fatjar) to avoid this issue? This was [previously discussed in the OpenSearch forums](https://forum.opensearch.org/t/model-deployment-failure-with-ml-commons-plugin-in-internet-disabled-environment/15428). **Documentation** It would be useful to document: - Which domains need to be whitelisted for the ML plugin to function (or if the list of dependencies is not easily predictable and varies depending on which model is used, we could document that) - Under what circumstances the plugin needs network access (only at deploy time?) Currently, the plugin appears to need network access to the following URLs when deploying, even when using a local model: - publish.djl.ai/pytorch - mlrepo.djl.ai (lack of access to this doesn't prevent this model from deploying, but generates several warnings in OpenSearch logs like `[WARN ][a.d.h.z.HfModelZoo ] [ip-172-31-58-14.ec2.internal] Failed to download Huggingface model zoo index: NLP.FILL_MASK`; not sure if this has consequences later) **Logging** Another way to improve this experience would be to log more information when there is a failure downloading dependencies. When deploying a local model, if an egress firewall is configured to drop packets to destinations that are not explicitly permitted, we get an error that doesn't tell us which destination we were trying to reach - from this, it is not obvious what address needs to be whitelisted. Here are the OpenSearch logs when deploying a local model under these circumstances: ``` [2024-01-23T00:10:53,793][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 1 [2024-01-23T00:10:54,582][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 2 [2024-01-23T00:10:55,342][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 3 [2024-01-23T00:10:55,922][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 4 [2024-01-23T00:10:56,444][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 5 [2024-01-23T00:10:56,997][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 6 [2024-01-23T00:10:57,481][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 7 [2024-01-23T00:10:57,840][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 8 [2024-01-23T00:10:58,215][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 9 [2024-01-23T00:10:58,612][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 10 [2024-01-23T00:10:58,988][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 11 [2024-01-23T00:10:59,399][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 12 [2024-01-23T00:10:59,786][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 13 [2024-01-23T00:10:59,977][INFO ][o.o.m.a.u.MLModelChunkUploader] [ip-172-31-62-254.ec2.internal] Index model successful for 786nM40BUDoVia3UznyW for chunk number 14 [2024-01-23T00:11:00,014][INFO ][o.o.m.a.d.TransportDeployModelAction] [ip-172-31-62-254.ec2.internal] Will deploy model on these nodes: Q6DHrMfSTRyIEHJNDCnCsw [2024-01-23T00:11:04,963][WARN ][a.d.u.c.CudaUtils ] [ip-172-31-62-254.ec2.internal] Access denied during loading cudart library. [2024-01-23T00:11:29,623][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOY_FAILED} [2024-01-23T00:11:39,584][INFO ][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-62-254.ec2.internal] Cancel background move metadata process. [2024-01-23T00:11:39,585][INFO ][o.o.i.i.ManagedIndexCoordinator] [ip-172-31-62-254.ec2.internal] Performing move cluster state metadata. [2024-01-23T00:11:39,585][INFO ][o.o.i.i.MetadataService ] [ip-172-31-62-254.ec2.internal] Move metadata has finished. [2024-01-23T00:11:39,618][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOYING} [2024-01-23T00:11:59,623][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOY_FAILED} [2024-01-23T00:12:09,622][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOYING} [2024-01-23T00:12:29,621][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOY_FAILED} [2024-01-23T00:12:39,625][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOYING} [2024-01-23T00:13:09,624][INFO ][o.o.m.c.MLSyncUpCron ] [ip-172-31-62-254.ec2.internal] Refresh model state: {786nM40BUDoVia3UznyW=DEPLOY_FAILED} [2024-01-23T00:13:14,922][ERROR][o.o.m.e.a.DLModel ] [ip-172-31-62-254.ec2.internal] Failed to deploy model 786nM40BUDoVia3UznyW ai.djl.engine.EngineException: Failed to save pytorch index file at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:403) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[api-0.21.0.jar:?] at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) [opensearch-ml-algorithms-2.11.1.0.jar:?] at java.security.AccessController.doPrivileged(AccessController.java:569) [?:?] at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:242) [opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:138) [opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1003) [opensearch-ml-2.11.1.0.jar:2.11.1.0] at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.1.jar:2.11.1] at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$58(MLModelManager.java:1123) [opensearch-ml-2.11.1.0.jar:2.11.1.0] at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.1.jar:2.11.1] at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.11.1.jar:2.11.1] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.1.jar:2.11.1] at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.1.jar:2.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.Net.connect0(Native Method) ~[?:?] at sun.nio.ch.Net.connect(Net.java:579) ~[?:?] at sun.nio.ch.Net.connect(Net.java:568) ~[?:?] at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593) ~[?:?] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?] at java.net.Socket.connect(Socket.java:633) ~[?:?] at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) ~[?:?] at sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:174) ~[?:?] at sun.net.NetworkClient.doConnect(NetworkClient.java:183) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:533) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:638) ~[?:?] at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266) ~[?:?] at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:380) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:193) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:179) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1665) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589) ~[?:?] at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224) ~[?:?] at java.net.URL.openStream(URL.java:1161) ~[?:?] at ai.djl.util.Utils.openUrl(Utils.java:461) ~[api-0.21.0.jar:?] at ai.djl.util.Utils.openUrl(Utils.java:445) ~[api-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:398) ~[pytorch-engine-0.21.0.jar:?] ... 22 more [2024-01-23T00:13:14,969][ERROR][o.o.m.m.MLModelManager ] [ip-172-31-62-254.ec2.internal] Failed to retrieve model 786nM40BUDoVia3UznyW org.opensearch.ml.common.exception.MLException: Failed to deploy model 786nM40BUDoVia3UznyW at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:289) ~[?:?] at java.security.AccessController.doPrivileged(AccessController.java:569) ~[?:?] at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:242) ~[?:?] at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:138) ~[?:?] at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?] at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1003) ~[?:?] at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.1.jar:2.11.1] at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$58(MLModelManager.java:1123) [opensearch-ml-2.11.1.0.jar:2.11.1.0] at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.1.jar:2.11.1] at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.11.1.jar:2.11.1] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.1.jar:2.11.1] at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.1.jar:2.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] Caused by: ai.djl.engine.EngineException: Failed to save pytorch index file at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:403) ~[?:?] at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[?:?] at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[?:?] at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[?:?] at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?] at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[?:?] at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?] at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[?:?] at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) ~[?:?] ... 14 more Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.Net.connect0(Native Method) ~[?:?] at sun.nio.ch.Net.connect(Net.java:579) ~[?:?] at sun.nio.ch.Net.connect(Net.java:568) ~[?:?] at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:593) ~[?:?] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?] at java.net.Socket.connect(Socket.java:633) ~[?:?] at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) ~[?:?] at sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:174) ~[?:?] at sun.net.NetworkClient.doConnect(NetworkClient.java:183) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:533) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:638) ~[?:?] at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266) ~[?:?] at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:380) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:193) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:179) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1665) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589) ~[?:?] at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224) ~[?:?] at java.net.URL.openStream(URL.java:1161) ~[?:?] at ai.djl.util.Utils.openUrl(Utils.java:461) ~[?:?] at ai.djl.util.Utils.openUrl(Utils.java:445) ~[?:?] at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:398) ~[?:?] at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[?:?] at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[?:?] at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[?:?] at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?] at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[?:?] at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?] at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[?:?] at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) ~[?:?] ... 14 more [2024-01-23T00:13:14,981][ERROR][o.o.m.a.f.TransportForwardAction] [ip-172-31-62-254.ec2.internal] deploy model failed on all nodes, model id: 786nM40BUDoVia3UznyW [2024-01-23T00:13:14,981][INFO ][o.o.m.a.f.TransportForwardAction] [ip-172-31-62-254.ec2.internal] deploy model done with state: DEPLOY_FAILED, model id: 786nM40BUDoVia3UznyW [2024-01-23T00:13:14,983][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [ip-172-31-62-254.ec2.internal] deploy model task done 8M6nM40BUDoVia3U7nw0 ``` Under this circumstance, GET `/_plugins/_ml/models/<model-id>` tells us the deploy failed, but does not provide a reason. (Not sure if the task API would provide more info - I couldn't see how to get opensearch-py-ml to give me the task ID.) ``` { "name": "sentence-transformers/all-MiniLM-L12-v2", "model_group_id": "pWZUEo0BgFhXOXZgeEi_", "algorithm": "TEXT_EMBEDDING", "model_version": "11", "model_format": "TORCH_SCRIPT", "model_state": "DEPLOY_FAILED", "model_content_size_in_bytes": 134568911, "model_content_hash_value": "f8012a4e6b5da1f556221a12160d080157039f077ab85a5f6b467a47247aad49", "model_config": { "model_type": "bert", "embedding_dimension": 384, "framework_type": "SENTENCE_TRANSFORMERS", "all_config": "{\"_name_or_path\":\"microsoft/MiniLM-L12-H384-uncased\",\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":12,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}" }, "created_time": 1705968651923, "last_updated_time": 1705968794982, "last_deployed_time": 1705968794981, "total_chunks": 14, "planning_worker_node_count": 1, "current_worker_node_count": 0, "planning_worker_nodes": [ "Q6DHrMfSTRyIEHJNDCnCsw" ], "deploy_to_all_nodes": true } ``` Please note, the above is assuming that DNS is permitted. If the egress firewall is also preventing DNS, the error is more useful and does contain the domain that needs to be whitelisted: ``` [2024-01-18T05:41:30,534][ERROR][o.o.m.e.a.DLModel ] [ip-172-31-58-14.ec2.internal] Failed to deploy model W9EWG40Blv3ldtU8hMVo ai.djl.engine.EngineException: Failed to save pytorch index file at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:403) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[pytorch-engine-0.21.0.jar:?] at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[api-0.21.0.jar:?] at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) [opensearch-ml-algorithms-2.11.1.0.jar:?] at java.security.AccessController.doPrivileged(AccessController.java:569) [?:?] at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:242) [opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:138) [opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-2.11.1.0.jar:?] at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1003) [opensearch-ml-2.11.1.0.jar:2.11.1.0] at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.1.jar:2.11.1] at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$58(MLModelManager.java:1123) [opensearch-ml-2.11.1.0.jar:2.11.1.0] at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.11.1.jar:2.11.1] at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.11.1.jar:2.11.1] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.1.jar:2.11.1] at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.1.jar:2.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] Caused by: java.net.UnknownHostException: publish.djl.ai at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:572) ~[?:?] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?] at java.net.Socket.connect(Socket.java:633) ~[?:?] at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304) ~[?:?] at sun.security.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:174) ~[?:?] at sun.net.NetworkClient.doConnect(NetworkClient.java:183) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:533) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:638) ~[?:?] at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266) ~[?:?] at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:380) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:193) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128) ~[?:?] at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:179) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1665) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589) ~[?:?] at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:224) ~[?:?] at java.net.URL.openStream(URL.java:1161) ~[?:?] at ai.djl.util.Utils.openUrl(Utils.java:461) ~[api-0.21.0.jar:?] at ai.djl.util.Utils.openUrl(Utils.java:445) ~[api-0.21.0.jar:?] at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:398) ~[pytorch-engine-0.21.0.jar:?] ... 22 more ```

Did you see that?

MHR · July 17, 2025, 2:01pm

Thanks Pablo, yeah saw that, but no solution.
Is it possible to deploy the model on a cluster with internet access, then export it and import it on another cluster without internet access ?

system · September 15, 2025, 2:01pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenSearch 2.9 ML Framework Model Upload Not Working Machine Learning	19	2270	September 30, 2023
How can we deploy ML model (.zip) to nodes locally, not via SSL or the firewall OpenSearch discuss , troubleshoot , configure , install	8	727	August 19, 2024
Offline deployment pretrained of models Machine Learning	3	648	September 23, 2024
Errors when deploy ML Models to Opensearch cluster OpenSearch	1	385	July 24, 2024
Could not upload model to opensearch cluster Machine Learning	2	1053	August 8, 2023

How to upload an ML model in offline mode on OpenSearch 2.17?

Related topics