Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
opensearch-2.10.0
Describe the issue:
When trying to deploy a ML Model - all-MiniLM-L6-v2_torchscript_sentence-transformer.zip to a Opensearch cluster, deployment errored out with following -
“model_id”: “Jg8BN5ABuctAWnb9HkNM”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“TINBM5k4SAmLgccLRqWh-g”,
“ezXgFqZfQMuLWK7NuT38ww”,
“QcU9XVpeSN27JRqo7qHlTw”
],
“create_time”: 1718909826109,
“last_update_time”: 1718909828811,
“error”: “”“{“TINBM5k4SAmLgccLRqWh-g”:“Network is unreachable”,“ezXgFqZfQMuLWK7NuT38ww”:“Network is unreachable”,“QcU9XVpeSN27JRqo7qHlTw”:“Network is unreachable”}”“”,
“is_async”: true
Configuration:
The all-MiniLM-L6-v2_torchscript_sentence-transformer.zip was downloaded to a local file and registered using opensearch_py_ml successfully. but with a “not responding” status and the errors was “Network is unreachable”.
Is this because the Model somehow has to go out to a public internet URL? The opensearch cluster hosted internal network and has no internet access.
Relevant Logs or Screenshots:
model_path = “all-MiniLM-L6-v2_torchscript_sentence-transformer.zip”
model_config_path = “sentence-transformer.json”
model_id_file_system = ml_client.register_model(model_path, model_config_path, isVerbose=True)
model_path = “all-MiniLM-L6-v2_torchscript_sentence-transformer.zip”
model_config_path = “sentence-transformer.json”
model_id_file_system = ml_client.register_model(model_path, model_config_path, isVerbose=True)
Task status -
{
“model_id”: “Jg8BN5ABuctAWnb9HkNM”,
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“TINBM5k4SAmLgccLRqWh-g”,
“ezXgFqZfQMuLWK7NuT38ww”,
“QcU9XVpeSN27JRqo7qHlTw”
],
“create_time”: 1718909826109,
“last_update_time”: 1718909828811,
“error”: “”“{“TINBM5k4SAmLgccLRqWh-g”:“Network is unreachable”,“ezXgFqZfQMuLWK7NuT38ww”:“Network is unreachable”,“QcU9XVpeSN27JRqo7qHlTw”:“Network is unreachable”}”“”,
“is_async”: true
}
[2024-06-20T14:57:08,809][ERROR][o.o.m.m.MLModelManager ] [] Failed to retrieve model Jg8BN5ABuctAWnb9HkNM
org.opensearch.ml.common.exception.MLException: Failed to deploy model Jg8BN5ABuctAWnb9HkNM
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:275) ~[?:?]
at java.security.AccessController.doPrivileged(AccessController.java:569) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:187) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:135) ~[?:?]
at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$deployModel$35(MLModelManager.java:804) ~[?:?]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.10.0.jar:2.10.0]
at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$41(MLModelManager.java:924) [opensearch-ml-2.10.0.0.jar:2.10.0.0]
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.10.0.jar:2.10.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.10.0.jar:2.10.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.10.0.jar:2.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: ai.djl.engine.EngineException: Failed to save pytorch index file
at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:403) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:286) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:89) ~[?:?]
at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:77) ~[?:?]
at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?]
at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:40) ~[?:?]
at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?]
at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:220) ~[?:?]
… 14 more
Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect0(Native Method) ~[?:?]
Thanks in advance for the help!