Errors when deploy ML Models to Opensearch cluster

Tao1 · June 20, 2024, 7:11pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

opensearch-2.10.0

Describe the issue:

When trying to deploy a ML Model - all-MiniLM-L6-v2_torchscript_sentence-transformer.zip to a Opensearch cluster, deployment errored out with following -

“model_id”: “Jg8BN5ABuctAWnb9HkNM”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“TINBM5k4SAmLgccLRqWh-g”,
“ezXgFqZfQMuLWK7NuT38ww”,
“QcU9XVpeSN27JRqo7qHlTw”
],
“create_time”: 1718909826109,
“last_update_time”: 1718909828811,
“error”: “”“{“TINBM5k4SAmLgccLRqWh-g”:“Network is unreachable”,“ezXgFqZfQMuLWK7NuT38ww”:“Network is unreachable”,“QcU9XVpeSN27JRqo7qHlTw”:“Network is unreachable”}”“”,
“is_async”: true

Configuration:
The all-MiniLM-L6-v2_torchscript_sentence-transformer.zip was downloaded to a local file and registered using opensearch_py_ml successfully. but with a “not responding” status and the errors was “Network is unreachable”.

Is this because the Model somehow has to go out to a public internet URL? The opensearch cluster hosted internal network and has no internet access.

Relevant Logs or Screenshots:
model_path = “all-MiniLM-L6-v2_torchscript_sentence-transformer.zip”
model_config_path = “sentence-transformer.json”
model_id_file_system = ml_client.register_model(model_path, model_config_path, isVerbose=True)

model_path = “all-MiniLM-L6-v2_torchscript_sentence-transformer.zip”
model_config_path = “sentence-transformer.json”
model_id_file_system = ml_client.register_model(model_path, model_config_path, isVerbose=True)

Task status -
{
“model_id”: “Jg8BN5ABuctAWnb9HkNM”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“TINBM5k4SAmLgccLRqWh-g”,
“ezXgFqZfQMuLWK7NuT38ww”,
“QcU9XVpeSN27JRqo7qHlTw”
],
“create_time”: 1718909826109,
“last_update_time”: 1718909828811,
“error”: “”“{“TINBM5k4SAmLgccLRqWh-g”:“Network is unreachable”,“ezXgFqZfQMuLWK7NuT38ww”:“Network is unreachable”,“QcU9XVpeSN27JRqo7qHlTw”:“Network is unreachable”}”“”,
“is_async”: true
}

[2024-06-20T14:57:08,809][ERROR][o.o.m.m.MLModelManager ] [] Failed to retrieve model Jg8BN5ABuctAWnb9HkNM
org.opensearch.ml.common.exception.MLException: Failed to deploy model Jg8BN5ABuctAWnb9HkNM
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:569) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:187) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:135) ~[?:?]
at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$35(MLModelManager.java:804) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.10.0.jar:2.10.0]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$41(MLModelManager.java:924) [opensearch-ml-2.10.0.0.jar:2.10.0.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.10.0.jar:2.10.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: ai.djl.engine.EngineException: Failed to save pytorch index file
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:403) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[?:?]
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?]
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[?:?]
at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:220) ~[?:?]
… 14 more
Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect0(Native Method) ~[?:?]

Thanks in advance for the help!

yeonghyeonKo · July 24, 2024, 7:15am

I’m dealing with this problem very similar with yours.

The model has been registered successfully using locally downloaded file but couldn’t be deployed to ML nodes. (not responding status)

Did you find any solution? I think your issue is caused by firewall blocking network connection between your environment and public Internet.

Topic		Replies	Views
How can we deploy ML model (.zip) to nodes locally, not via SSL or the firewall OpenSearch discuss , troubleshoot , configure , install	8	497	August 19, 2024
How to upload an ML model in offline mode on OpenSearch 2.17? Machine Learning	10	60	July 17, 2025
Can't generate embedding with a ML Huggingface model successfully deployed "Model no deployed" Machine Learning	4	1111	October 24, 2023
OpenSearch 2.9 ML Framework Model Upload Not Working Machine Learning	19	2086	September 30, 2023
TransportError - m_l_exception - model not deployed Machine Learning	5	1150	July 11, 2023

Errors when deploy ML Models to Opensearch cluster

Related topics