The machine learning plugin intermittently fails to load the model. When successful, the loading process is not instantaneous and the plugin attempts to load the model multiple times before it is loaded successfully. Logs from a successful load attempt and a failed load attempt are provided below for reference.
[opensearch] Will load model on these nodes: eHXE2qZYTW6y6djX5COymQ
[opensearch] Access denied during loading cudart library.
[opensearch] Downloading https://publish.djl.ai/pytorch/1.12.1/cpu-precxx11/linux-x86_64/native/lib/libgomp-a34b3233.so.1.gz ...
[opensearch] Downloading https://publish.djl.ai/pytorch/1.12.1/cpu-precxx11/linux-x86_64/native/lib/libc10.so.gz ...
[opensearch] Downloading https://publish.djl.ai/pytorch/1.12.1/cpu-precxx11/linux-x86_64/native/lib/libtorch_cpu.so.gz ...
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOAD_FAILED}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOADING}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOAD_FAILED}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOADING}
[opensearch] Running full sweep
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOAD_FAILED}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOADING}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOAD_FAILED}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOADING}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOAD_FAILED}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOADING}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOAD_FAILED}
[opensearch] Refresh model state: {q3Ms64YB6n-KyvL9Phcb=LOADING}
[opensearch] Downloading https://publish.djl.ai/pytorch/1.12.1/cpu-precxx11/linux-x86_64/native/lib/libtorch.so.gz ...
[opensearch] Downloading https://publish.djl.ai/pytorch/1.12.1/cpu-precxx11/linux-x86_64/native/lib/libstdc%2B%2B.so.6.gz ...
opensearch | OpenJDK 64-Bit Server VM warning: You have loaded library /usr/share/opensearch/data/djl/pytorch/1.12.1-cpu-precxx11-linux-x86_64/libtorch_cpu.so which might have disabled stack guard. The VM will try to fix the stack guard now.
opensearch | It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
[opensearch] Downloading jni https://publish.djl.ai/pytorch/1.12.1/jnilib/0.19.0/linux-x86_64/cpu-precxx11/libdjl_torch.so to cache ...
[opensearch] Number of inter-op threads is 1
[opensearch] Number of intra-op threads is 1
[opensearch] Extracting native/lib/linux-x86_64/libtokenizers.so to cache ...
[opensearch] Model q3Ms64YB6n-KyvL9Phcb is successfully loaded on 1 devices
[opensearch] load model done with state: LOADED, model id: q3Ms64YB6n-KyvL9Phcb
[opensearch] load model task done rHNc64YB6n-KyvL9YBev
However, most of the time it doesn’t work. I am getting several errors. One of them is as follows.
{'task_type': 'UPLOAD_MODEL',
'function_name': 'TEXT_EMBEDDING',
'state': 'FAILED',
'worker_node': ['5ZWIwnnLRCeduZOow3fBrQ'],
'create_time': 1679043105053,
'last_update_time': 1679043105133,
'error': 'Native Memory Circuit Breaker is open, please check your resources!',
'is_async': True}
That happens although I have allocated enough memory.
More context:
GET _cat/nodes?v=true&h=name,node*,heap*
Gives
name id node.role node.roles heap.current heap.percent heap.max
opensearch 5ZWI dim data,ingest,master,ml 661.6mb 16 4gb
Another error I get on a different try is:
Downloading: 100% |========================================| all-mpnet-base-v2.zip ] [opensearch]
opensearch | [2023-03-17T12:21:27,008][ERROR][o.o.m.m.MLModelManager ] [opensearch] Failed to index chunk file
opensearch | java.security.PrivilegedActionException: null
opensearch | at java.security.AccessController.doPrivileged(AccessController.java:573) ~[?:?]
opensearch | at org.opensearch.ml.engine.ModelHelper.downloadAndSplit(ModelHelper.java:147) [opensearch-ml-algorithms-2.6.0.0.jar:?]
opensearch | at org.opensearch.ml.model.MLModelManager.uploadModel(MLModelManager.java:268) [opensearch-ml-2.6.0.0.jar:2.6.0.0]
opensearch | at org.opensearch.ml.model.MLModelManager.lambda$uploadModelFromUrl$3(MLModelManager.java:241) [opensearch-ml-2.6.0.0.jar:2.6.0.0]
opensearch | at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [opensearch-2.6.0.jar:2.6.0]
opensearch | at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.6.0.jar:2.6.0]
opensearch | at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.6.0.jar:2.6.0]
opensearch | at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.6.0.jar:2.6.0]
opensearch | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
opensearch | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
opensearch | at java.lang.Thread.run(Thread.java:833) [?:?]
opensearch | Caused by: java.nio.file.NoSuchFileException: /usr/share/opensearch/data/djl/models_cache/upload/emt474YB_2fbBQq1ySIS/1.0.0/huggingface/sentence-transformers/all-mpnet-base-v2.zip
opensearch | at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
opensearch | at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
opensearch | at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
opensearch | at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
opensearch | at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148) ~[?:?]
opensearch | at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99) ~[?:?]
opensearch | at java.nio.file.Files.readAttributes(Files.java:1851) ~[?:?]
opensearch | at java.util.zip.ZipFile$Source.get(ZipFile.java:1264) ~[?:?]
opensearch | at java.util.zip.ZipFile$CleanableResource.<init>(ZipFile.java:709) ~[?:?]
opensearch | at java.util.zip.ZipFile.<init>(ZipFile.java:243) ~[?:?]
opensearch | at java.util.zip.ZipFile.<init>(ZipFile.java:172) ~[?:?]
opensearch | at java.util.zip.ZipFile.<init>(ZipFile.java:143) ~[?:?]
opensearch | at org.opensearch.ml.engine.ModelHelper.verifyModelZipFile(ModelHelper.java:174) ~[?:?]
opensearch | at org.opensearch.ml.engine.ModelHelper.lambda$downloadAndSplit$2(ModelHelper.java:154) ~[?:?]
opensearch | at java.security.AccessController.doPrivileged(AccessController.java:569) ~[?:?]
opensearch | ... 10 more
How might these issues be addressed?