Model deployment failure with ml-commons plugin in internet disabled environment

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch - 2.8.0
Server OS - Oracle Linux Server 7.9

Describe the issue:

I’m trying to run the OpenSearch cluster with ml-commons plugin in an environment where internet access is disabled. While trying to deploy a PyTorch model using local file, the ml-commons plugin is trying to download the PyTorch engine from the internet. Is there a way to build ml-commons in such a way that all the dependencies are already present in the tar/zip which is plugged into the OpenSearch image? Can we create a fat jar somehow to make sure that it doesn’t refer to the internet internally?

API used:

Issue happens with deployment after successful registering with below API:

POST _plugins/_ml/models/_register
{
“name”: “sentence-transformers/all-MiniLM-L6-v2”,
“model_group_id”: “tcV5fsf43751lzeNBC-wj”,
“version”: “1.0.1”,
“model_format”: “TORCH_SCRIPT”,
“model_content_hash_value”: “c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f”,
“model_config”: {
“model_type”: “bert”,
“embedding_dimension”: 384,
“framework_type”: “sentence_transformers”
},
“url”: “file:///usr/data/model/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip”
}

Relevant Logs or Screenshots:

ai.djl.engine.EngineException: Failed to load PyTorch native library
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:82)
	at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40)
	at ai.djl.engine.Engine.getEngine(Engine.java:181)
	at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:209)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:176)
	at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:135)
	at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:116)
	at org.opensearch.ml.model.MLModelManager.lambda$deployModel$29(MLModelManager.java:671)
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80)
	at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$35(MLModelManager.java:776)
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80)
	at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.IllegalStateException: Failed to save pytorch index file
	at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:399)
	at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:278)
	at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:87)
	at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:75)
	at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53)
	... 17 more
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.base/java.net.Socket.connect(Socket.java:608)
	at org.bouncycastle.jsse.provider.ProvSSLSocketDirect.connect(ProvSSLSocketDirect.java:170)
	at java.base/java.net.Socket.connect(Socket.java:557)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:510)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:605)
	at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265)
	at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372)
	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:207)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1071)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1069)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/java.security.AccessController.doPrivilegedWithCombiner(AccessController.java:795)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1068)
	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:193)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1512)
	at java.base/sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1510)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/java.security.AccessController.doPrivilegedWithCombiner(AccessController.java:795)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1509)
	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250)
	at java.base/java.net.URL.openStream(URL.java:1165)
	at ai.djl.util.Utils.openUrl(Utils.java:459)
	at ai.djl.util.Utils.openUrl(Utils.java:443)
	at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:394)
	... 21 more
 Failed to deploy model zS8RtIkBdyx9l2hZOMXg

Below are some useful articles I found:

@amank Thanks for trying this feature. ml-commons doesn’t package all related engine dependencies in release package. Feel free to cut a Github issue about this Issues · opensearch-project/ml-commons · GitHub

Can you also explain why you have to run cluster this way? That can help us understand how user will use this feature.
BTW, do you see any issue with cluster which has internet access?

1 Like

Thank you @ylwu for sharing that all related engine dependencies are not included in the release package.

I’m trying to setup the OpenSearch cluster with various plugins in an internet disabled region. I haven’t used it on a region where internet is enabled so I cannot comment on it yet. I’ll share the details in case I do so in the future.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.