Errors when deploy ML Models to Opensearch cluster

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

opensearch-2.10.0

Describe the issue:

When trying to deploy a ML Model - all-MiniLM-L6-v2_torchscript_sentence-transformer.zip to a Opensearch cluster, deployment errored out with following -

“model_id”: “Jg8BN5ABuctAWnb9HkNM”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“TINBM5k4SAmLgccLRqWh-g”,
“ezXgFqZfQMuLWK7NuT38ww”,
“QcU9XVpeSN27JRqo7qHlTw”
],
“create_time”: 1718909826109,
“last_update_time”: 1718909828811,
“error”: “”“{“TINBM5k4SAmLgccLRqWh-g”:“Network is unreachable”,“ezXgFqZfQMuLWK7NuT38ww”:“Network is unreachable”,“QcU9XVpeSN27JRqo7qHlTw”:“Network is unreachable”}”“”,
“is_async”: true

Configuration:
The all-MiniLM-L6-v2_torchscript_sentence-transformer.zip was downloaded to a local file and registered using opensearch_py_ml successfully. but with a “not responding” status and the errors was “Network is unreachable”.

Is this because the Model somehow has to go out to a public internet URL? The opensearch cluster hosted internal network and has no internet access.

Relevant Logs or Screenshots:
model_path = “all-MiniLM-L6-v2_torchscript_sentence-transformer.zip”
model_config_path = “sentence-transformer.json”
model_id_file_system = ml_client.register_model(model_path, model_config_path, isVerbose=True)

model_path = “all-MiniLM-L6-v2_torchscript_sentence-transformer.zip”
model_config_path = “sentence-transformer.json”
model_id_file_system = ml_client.register_model(model_path, model_config_path, isVerbose=True)

Task status -
{
“model_id”: “Jg8BN5ABuctAWnb9HkNM”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“TINBM5k4SAmLgccLRqWh-g”,
“ezXgFqZfQMuLWK7NuT38ww”,
“QcU9XVpeSN27JRqo7qHlTw”
],
“create_time”: 1718909826109,
“last_update_time”: 1718909828811,
“error”: “”“{“TINBM5k4SAmLgccLRqWh-g”:“Network is unreachable”,“ezXgFqZfQMuLWK7NuT38ww”:“Network is unreachable”,“QcU9XVpeSN27JRqo7qHlTw”:“Network is unreachable”}”“”,
“is_async”: true
}

[2024-06-20T14:57:08,809][ERROR][o.o.m.m.MLModelManager ] [] Failed to retrieve model Jg8BN5ABuctAWnb9HkNM
org.opensearch.ml.common.exception.MLException: Failed to deploy model Jg8BN5ABuctAWnb9HkNM
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:569) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:187) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:135) ~[?:?]
at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$35(MLModelManager.java:804) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.10.0.jar:2.10.0]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$41(MLModelManager.java:924) [opensearch-ml-2.10.0.0.jar:2.10.0.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.10.0.jar:2.10.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: ai.djl.engine.EngineException: Failed to save pytorch index file
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:403) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[?:?]
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?]
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[?:?]
at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:220) ~[?:?]
… 14 more
Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect0(Native Method) ~[?:?]

Thanks in advance for the help!

1 Like

I’m dealing with this problem very similar with yours.

The model has been registered successfully using locally downloaded file but couldn’t be deployed to ML nodes. (not responding status)

Did you find any solution? I think your issue is caused by firewall blocking network connection between your environment and public Internet.