Deploying custom sentence embedding model - onnxruntime not found

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch (docker-compose cluster running on osx arm64)

{
    "distribution": "opensearch",
    "number": "2.12.0",
    "build_type": "tar",
    "build_hash": "2c355ce1a427e4a528778d4054436b5c4b756221",
    "build_date": "2024-02-20T02:20:12.084014282Z",
    "build_snapshot": false,
    "lucene_version": "9.9.2",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  }

local python 3.11.8 environment:

opensearch-py==2.4.2
opensearch-py-ml==1.1.0
onnx==1.15.0
onnxruntime==1.17.1
torch==2.2.1
sentence-transformers==2.5.1
transformers==4.38.2

Describe the issue:

When attempting to deploy a custom onnx model, an exception is thrown on opensearch that it cannot locate the onnxruntime.

Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.UnsatisfiedLinkError: no onnxruntime in java.library.path: :/usr/share/opensearch/plugins/opensearch-knn/lib:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib

Full description

I want to load this huggingface model into opensearch for text embeddings.

I am running the following python code to convert to onnx and register the model on opensearch. I can see the model successfully registered in opensearch, but it fails during the deployment step.

from opensearchpy import OpenSearch
from opensearch_py_ml.ml_models import SentenceTransformerModel
from opensearch_py_ml.ml_commons import MLCommonClient

os_client = create_client()
ml_client = MLCommonClient(os_client)

# manually created model group
model_group_id = "..."

# test registering a model
model_hf_id = "microsoft/BiomedNLP-KRISSBERT-PubMed-UMLS-EL"
folder_path = "./models"

embedding_model = SentenceTransformerModel(model_id=model_hf_id, folder_path=folder_path, overwrite=True)
model_path_onnx = embedding_model.save_as_onnx(model_id=model_hf_id)
model_config_path_onnx = embedding_model.make_model_config_json(model_format="ONNX")
model_id = ml_client.register_model(model_path_onnx, model_config_path_onnx, isVerbose=True, model_group_id=model_group_id)

Logs for the local python code:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Creating folder models/onnx
Using framework PyTorch: 2.2.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']
zip file is saved to  ./models/BiomedNLP-KRISSBERT-PubMed-UMLS-EL.zip 

No sentence-transformers model found with name microsoft/BiomedNLP-KRISSBERT-PubMed-UMLS-EL. Creating a new one with MEAN pooling.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ml-commons_model_config.json file is saved at :  ./models/ml-commons_model_config.json
Total number of chunks 44
Sha1 value of the model file:  95ebbcb89d0c9883749a266f123fbbd34b8a67ce6dd3bfeeea02681aa01b2be7
Model meta data was created successfully. Model Id:  6lXVHo4BOkNyivbcAu-Y
uploading chunk 1 of 44
Model id: {'status': 'Uploaded'}
uploading chunk 2 of 44
Model id: {'status': 'Uploaded'}
...
uploading chunk 44 of 44
Model id: {'status': 'Uploaded'}
Model registered successfully

...
  File "lib/python3.11/site-packages/opensearch_py_ml/ml_commons/ml_commons_client.py", line 157, in register_model
    self.deploy_model(model_id, wait_until_deployed=wait_until_deployed)
  File "lib/python3.11/site-packages/opensearch_py_ml/ml_commons/ml_commons_client.py", line 357, in deploy_model
    raise Exception("Model deployment failed")
Exception: Model deployment failed

Checking the opensearch docker-compose logs:

opensearch-node1       | [2024-03-08T16:11:30,813][INFO ][o.o.m.a.u.MLModelChunkUploader] [opensearch-node1] Index model successful for 6lXVHo4BOkNyivbcAu-Y for chunk number 44
opensearch-node1       | [2024-03-08T16:11:30,834][INFO ][o.o.m.a.d.TransportDeployModelAction] [opensearch-node1] Will deploy model on these nodes: 7CRPceCOT16B8-dO2sNqlQ
opensearch-ml1         | [2024-03-08T16:11:30,898][ERROR][o.o.m.m.MLModelManager   ] [opensearch-ml1] No controller is deployed because the model 6lXVHo4BOkNyivbcAu-Y is expected not having an enabled model controller. Please use the create model controller api to create one if this is unexpected.
opensearch-ml1         | [2024-03-08T16:11:34,830][ERROR][o.o.m.e.a.DLModel        ] [opensearch-ml1] Failed to deploy model 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         | java.lang.NoClassDefFoundError: Could not initialize class ai.onnxruntime.OrtEnvironment$ThreadingOptions
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.<init>(OrtEngine.java:44) ~[onnxruntime-engine-0.21.0.jar:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.newInstance(OrtEngine.java:64) ~[onnxruntime-engine-0.21.0.jar:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngineProvider.getEngine(OrtEngineProvider.java:40) ~[onnxruntime-engine-0.21.0.jar:?]
opensearch-ml1         |        at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[api-0.21.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:280) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) [?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:247) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:139) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1020) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-ml1         |        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-ml1         | Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.UnsatisfiedLinkError: no onnxruntime in java.library.path: :/usr/share/opensearch/plugins/opensearch-knn/lib:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib [in thread "opensearch[opensearch-ml1][opensearch_ml_deploy][T#5]"]
opensearch-ml1         |        at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2458) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.System.loadLibrary(System.java:2063) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:338) ~[onnxruntime_gpu-1.14.0.jar:1.14.0]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:139) ~[onnxruntime_gpu-1.14.0.jar:1.14.0]
opensearch-ml1         |        at ai.onnxruntime.OrtEnvironment$ThreadingOptions.<clinit>(OrtEnvironment.java:353) ~[onnxruntime_gpu-1.14.0.jar:1.14.0]
opensearch-ml1         |        ... 20 more
opensearch-ml1         | [2024-03-08T16:11:34,869][ERROR][o.o.m.m.MLModelManager   ] [opensearch-ml1] Failed to retrieve model 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         | org.opensearch.ml.common.exception.MLException: Failed to deploy model 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:294) ~[?:?]
opensearch-ml1         |        at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:247) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:139) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1020) ~[?:?]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-ml1         |        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-ml1         | Caused by: java.lang.NoClassDefFoundError: Could not initialize class ai.onnxruntime.OrtEnvironment$ThreadingOptions
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.<init>(OrtEngine.java:44) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.newInstance(OrtEngine.java:64) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngineProvider.getEngine(OrtEngineProvider.java:40) ~[?:?]
opensearch-ml1         |        at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:280) ~[?:?]
opensearch-ml1         |        ... 14 more
opensearch-ml1         | Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.UnsatisfiedLinkError: no onnxruntime in java.library.path: :/usr/share/opensearch/plugins/opensearch-knn/lib:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib [in thread "opensearch[opensearch-ml1][opensearch_ml_deploy][T#5]"]
opensearch-ml1         |        at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2458) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.System.loadLibrary(System.java:2063) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:338) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:139) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OrtEnvironment$ThreadingOptions.<clinit>(OrtEnvironment.java:353) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.<init>(OrtEngine.java:44) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.newInstance(OrtEngine.java:64) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngineProvider.getEngine(OrtEngineProvider.java:40) ~[?:?]
opensearch-ml1         |        at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:280) ~[?:?]
opensearch-ml1         |        ... 14 more
opensearch-node1       | [2024-03-08T16:11:34,883][ERROR][o.o.m.a.f.TransportForwardAction] [opensearch-node1] deploy model failed on all nodes, model id: 6lXVHo4BOkNyivbcAu-Y
opensearch-node1       | [2024-03-08T16:11:34,883][INFO ][o.o.m.a.f.TransportForwardAction] [opensearch-node1] deploy model done with state: DEPLOY_FAILED, model id: 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         | [2024-03-08T16:11:34,884][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml1] deploy model task done 61XVHo4BOkNyivbceu-U

It appears that necessary onnxruntime libraries are not installed (or not properly configured) on the docker images.

Attempted fix

I didn’t see any libonnxruntime files in the lib directories, so I decided to copy an onnx runtime library (libonnxruntime.so.1.17.1) into /usr/share/opensearch/plugins/opensearch-knn/lib:

[opensearch@33bd679ce232 opensearch-knn]$ ls /usr/share/opensearch/plugins/opensearch-knn/lib
libgomp.so.1       libonnxruntime.so.1.17.1    libopensearchknn_faiss.so
libonnxruntime.so  libopensearchknn_common.so  libopensearchknn_nmslib.so

When I did this and tried to redeploy the model, I got a different error:

opensearch-ml1         | [2024-03-08T16:38:36,909][ERROR][o.o.m.e.a.DLModel        ] [opensearch-ml1] Failed to deploy model 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         | java.lang.UnsatisfiedLinkError: no onnxruntime4j_jni in java.library.path: :/usr/share/opensearch/plugins/opensearch-knn/lib:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
opensearch-ml1         |        at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2458) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.System.loadLibrary(System.java:2063) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:338) ~[onnxruntime_gpu-1.14.0.jar:1.14.0]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:140) ~[onnxruntime_gpu-1.14.0.jar:1.14.0]
opensearch-ml1         |        at ai.onnxruntime.OrtEnvironment$ThreadingOptions.<clinit>(OrtEnvironment.java:353) ~[onnxruntime_gpu-1.14.0.jar:1.14.0]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.<init>(OrtEngine.java:44) ~[onnxruntime-engine-0.21.0.jar:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.newInstance(OrtEngine.java:64) ~[onnxruntime-engine-0.21.0.jar:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngineProvider.getEngine(OrtEngineProvider.java:40) ~[onnxruntime-engine-0.21.0.jar:?]
opensearch-ml1         |        at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[api-0.21.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:280) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) [?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:247) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:139) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) [opensearch-ml-algorithms-2.12.0.0.jar:?]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1020) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-ml1         |        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-ml1         | [2024-03-08T16:38:36,933][ERROR][o.o.m.m.MLModelManager   ] [opensearch-ml1] Failed to retrieve model 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         | org.opensearch.ml.common.exception.MLException: Failed to deploy model 6lXVHo4BOkNyivbcAu-Y
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:294) ~[?:?]
opensearch-ml1         |        at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:247) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:139) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1020) ~[?:?]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) [opensearch-ml-2.12.0.0.jar:2.12.0.0]
opensearch-ml1         |        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-ml1         |        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-ml1         |        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-ml1         | Caused by: java.lang.UnsatisfiedLinkError: no onnxruntime4j_jni in java.library.path: :/usr/share/opensearch/plugins/opensearch-knn/lib:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
opensearch-ml1         |        at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2458) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916) ~[?:?]
opensearch-ml1         |        at java.base/java.lang.System.loadLibrary(System.java:2063) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.load(OnnxRuntime.java:338) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OnnxRuntime.init(OnnxRuntime.java:140) ~[?:?]
opensearch-ml1         |        at ai.onnxruntime.OrtEnvironment$ThreadingOptions.<clinit>(OrtEnvironment.java:353) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.<init>(OrtEngine.java:44) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngine.newInstance(OrtEngine.java:64) ~[?:?]
opensearch-ml1         |        at ai.djl.onnxruntime.engine.OrtEngineProvider.getEngine(OrtEngineProvider.java:40) ~[?:?]
opensearch-ml1         |        at ai.djl.engine.Engine.getEngine(Engine.java:187) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:185) ~[?:?]
opensearch-ml1         |        at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:280) ~[?:?]
opensearch-ml1         |        ... 14 more
opensearch-node1       | [2024-03-08T16:38:36,945][ERROR][o.o.m.a.f.TransportForwardAction] [opensearch-node1] deploy model failed on all nodes, model id: 6lXVHo4BOkNyivbcAu-Y
opensearch-node1       | [2024-03-08T16:38:36,945][INFO ][o.o.m.a.f.TransportForwardAction] [opensearch-node1] deploy model done with state: DEPLOY_FAILED, model id: 6lXVHo4BOkNyivbcAu-Y

It seems like adding the libonnxruntime.so file to the lib directory worked to fix part of the issue. But now there is a different dependency that I am not sure how to include onnxruntime4j_jni. When looking at the onnxruntime release, I can’t find this dependency. I think it has to do with the java build/dependency (jni = Java Native Interface). But I’m not a java expert, so I’m stuck here.

I don’t think this is at all a root cause fix, but shows that including the runtime .so file is on the right track - the onnxruntime is not configured properly.

Configuration:
The docker-compose cluster is set up with 2 search nodes and 1 ml node. I can provide my full docker-compose.yaml file if needed.

{
  "persistent": {
    "plugins": {
      "ml_commons": {
        "task_dispatch_policy": "round_robin",
        "monitoring_request_count": "100",
        "max_model_on_node": "20",
        "sync_up_job_interval_in_seconds": "3",
        "max_ml_task_per_node": "10",
        "only_run_on_ml_node": "true",
        "ml_task_timeout_in_seconds": "600",
        "model_access_control_enabled": "true",
        "native_memory_threshold": "100",
        "allow_registering_model_via_local_file": "true",
        "allow_registering_model_via_url": "true"
      },
      "index_state_management": {
        "template_migration": {
          "control": "-1"
        }
      }
    }
  },
  "transient": {}
}

Relevant Logs or Screenshots:

Logs were posted in the description.