OpenSearch 2.9 ML Framework Model Upload Not Working

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OS 2.9/Windows 11 & Rocky Linux 8/Chromium Edge

Describe the issue:

Can we have a fully working example on 2.9?

I tried the steps in ML framework - OpenSearch documentation,
I executed the below

POST /_plugins/_ml/models/_upload
{
“name”: “all-MiniLM-L6-v2”,
“version”: “1.0.0”,
“description”: “test model”,
“model_format”: “TORCH_SCRIPT”,
“model_config”: {
“model_type”: “bert”,
“embedding_dimension”: 384,
“framework_type”: “sentence_transformers”
},
“url”: “https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true
}

It never worked for me. I got the below error in the log. In addition, when I load a model, sometimes it disappears and when I try to load it again it says that the same ID already exists!

I appreciate your input if you got it working. By the way, it was working in previous versions (2.4 in particular)

Failed to index chunk file
java.security.PrivilegedActionException: null
at java.security.AccessController.doPrivileged(AccessController.java:573) ~[?:?]
at org.opensearch.ml.engine.ModelHelper.downloadAndSplit(ModelHelper.java:197) [opensearch-ml-algorithms-2.9.0.0.jar:?]
at org.opensearch.ml.model.MLModelManager.registerModel(MLModelManager.java:526) [opensearch-ml-2.9.0.0.jar:2.9.0.0]
at org.opensearch.ml.model.MLModelManager.lambda$registerModelFromUrl$19(MLModelManager.java:498) [opensearch-ml-2.9.0.0.jar:2.9.0.0]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.nio.file.NoSuchFileException: /opensearch/opensearch-2.9.0/data/ml_cache/models_cache/register/s1Y1jYkBxklcJkLMTjwU/1/all-MiniLM-L6-v2.zip

I am also using the below configs in addition to the defaults.
plugins.ml_commons.only_run_on_ml_node: false
plugins.ml_commons.allow_registering_model_via_url: true

Can you please try:

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

We are supporting few pre-trained models: Pretrained models - OpenSearch documentation

Note, we updated api endpoint name (while _upload is also supported for a while) to _register

Updated api endpoints are here: Redirecting…

Thanks.

@dhrubo When I am trying this in OS 2.8, Any help on this please
POST /_plugins/_ml/models/_register
{
“name”: “huggingface/sentence-tramsformers/all-MiniLM-L12-v2”,
“model_group_id”: “Mmtzj4kBUMHslIHqUQaR”,
“version”: “1.0.1”,
“model_format”: “TORCH_SCRIPT”
}
It gives me the task id as response and I am using this taskid for GET, it says connection timed out, any reason why ?
GET /_plugins/_ml/tasks/sBHHjokBS-NchEYQO9Nz
{
“task_type”: “DEPLOY_MODEL”,
“function_name”: “TEXT_EMBEDDING”,
“state”: “FAILED”,
“worker_node”: [
“sNgMP15nRkOtp8ZFFwAg1g”
],
“create_time”: 1690332460326,
“last_update_time”: 1690332588709,
“error”: “Connection timed out (Connection timed out)”,
“is_async”: true
}

When I search the logs I see this on datanodes
[2023-07-26T00:08:47,306][ERROR][o.o.m.m.MLModelManager ] [opensearch-data-1] Failed to update model group
java.security.PrivilegedActionException: null
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at org.opensearch.ml.engine.ModelHelper.downloadPrebuiltModelMetaList(ModelHelper.java:166) ~[?:?]
at org.opensearch.ml.model.MLModelManager.registerPrebuiltModel(MLModelManager.java:510) ~[?:?]
at org.opensearch.ml.model.MLModelManager.uploadModel(MLModelManager.java:357) ~[?:?]
at org.opensearch.ml.model.MLModelManager.lambda$registerMLModel$11(MLModelManager.java:324) ~[?:?]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.8.0.jar:2.8.0]
at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) ~[opensearch-2.8.0.jar:2.8.0]
at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) ~[opensearch-2.8.0.jar:2.8.0]
at com.oracle.pic.opensearch.actions.MonitoringActionFilter.lambda$apply$0(MonitoringActionFilter.java:68) ~[?:?]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.8.0.jar:2.8.0]

Can you check if the model group id is valid or not? Suggest try to use OS2.9 the latest released version.

Thanks @dhrubo , I can register a model now from a URL. However, I keep getting errors when registering my custom model from a local directory.

POST /_plugins/_ml/models/_register
{
“name”: “all-MiniLM-L6-v2-128”,
“version”: “1.0.0”,
“description”: “test model”,
“model_format”: “TORCH_SCRIPT”,
“model_config”: {
“model_type”: “bert”,
“embedding_dimension”: 128,
“framework_type”: “sentence_transformers”
},
“url”: “file://home/hasan/all-MiniLM-L6-v2-128.zip”
}

Failed to index chunk file
java.security.AccessControlException: access denied (“java.io.FilePermission” “/home/myuser/all-MiniLM-L6-v2-128.zip” “read”)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:485) ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:1068) ~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:416) ~[?:?]
at java.lang.SecurityManager.checkRead(SecurityManager.java:756) ~[?:?]
at java.io.File.isDirectory(File.java:860) ~[?:?]
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:78) ~[?:?]
at sun.net.www.protocol.file.FileURLConnection.initializeHeaders(FileURLConnection.java:106) ~[?:?]
at sun.net.www.protocol.file.FileURLConnection.getContentLengthLong(FileURLConnection.java:164) ~[?:?]
at ai.djl.training.util.DownloadUtils.download(DownloadUtils.java:73) ~[api-0.21.0.jar:?]
at ai.djl.training.util.DownloadUtils.download(DownloadUtils.java:52) ~[api-0.21.0.jar:?]
at org.opensearch.ml.engine.ModelHelper.lambda$downloadAndSplit$3(ModelHelper.java:203) ~[opensearch-ml-algorithms-2.9.0.0.jar:?]
at java.security.AccessController.doPrivileged(AccessController.java:56

From the log it looks like Java security settings are denying read access to the file /home/myuser/all-MiniLM-L6-v2-128.zip

Java has a built-in security manager that, when enabled, can control actions like file I/O. In your case, it seems like the security manager is denying the read access to your file.

Can you please ensure that the file or directory has the correct permissions at the operating system level. For example, in a Unix-like operating system, you could use the chmod command to give read permission to all users:

chmod a+r /home/myuser/all-MiniLM-L6-v2-128.zip

You can check out this notebook also: Demo Notebook for MLCommons Integration — Opensearch-py-ml 1.0.0 documentation

OpenSearch security manager doesn’t allow read this file “url”: “file://home/hasan/all-MiniLM-L6-v2-128.zip”.

You can try to upload your model to somewhere like Github/S3, then you can use that public URL to upload model.

If you need to upload a model file from your local, you can follow the demo notebook shared by @dhrubo Demo Notebook for MLCommons Integration — Opensearch-py-ml 1.0.0 documentation

I noticed that OpenSearch downloads some djl packages after registering and loading a huggingface model. Is there a way to run completely offline? That is, have all required library files and models made available offline and placed in the correct folder?

@ylwu I followed the Python guide but I got the below error. It looks like the opensearch-py-ml is not up to date in pypi. I even removed and reinstalled it and got the same error.

ml_client.register_model(model_path, model_config_path_torch, isVerbose=True)

AttributeError: ‘MLCommonClient’ object has no attribute ‘register_model’

I noticed that OpenSearch downloads some djl packages after registering and loading a huggingface model. Is there a way to run completely offline? That is, have all required library files and models made available offline and placed in the correct folder?

For now, no. To support different platforms, we have to dynamically download the library files. If this is a must-have, you can create a Github issue on ml-commons repo.Issues · opensearch-project/ml-commons · GitHub

AttributeError: ‘MLCommonClient’ object has no attribute ‘register_model’

@dhrubo Can you help take a look?

To unblock you, can you try pretrained model? Pretrained models - OpenSearch documentation

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

Thanks I will try to log an issue. But at least for now, can you point me to the location in which these files get downloaded. My environment is Linux (CentOS/RedHat/Rocky Linux 8)

Hi, We just released a version (1.1.0) for opensearch-py-ml. Can you please update the plugin and give a try?

Thanks
Dhrubo

you should see in <OS_HOME>/data/ml_cache folder

But the ml_cache as far as I understood is a temporary location that can be cleared automatically. My point is, I want to make sure that subsequent calls to deploy ml model will always be offline. Is that something possible?

Only first call will download, the downloaded library won’t be deleted from cache folder (unless you delete it manually). So it will be offline for subsequent calls.

The API is quite confusing! Now it is not working anymore. I cleaned the data directory and started again with the below config and then executed the below POSTs

######## End OpenSearch Security Demo Configuration ########
plugins.ml_commons.only_run_on_ml_node: false
plugins.ml_commons.allow_registering_model_via_local_file: true
plugins.ml_commons.allow_registering_model_via_url: true
node.roles: [cluster_manager, data, ingest, ml ]

POST /_plugins/_ml/models/_register
{
“name”: “huggingface/sentence-transformers/all-MiniLM-L12-v2”,
“version”: “1.0.1”,
“model_format”: “TORCH_SCRIPT”
}

It generates ghP9kYkB_Tms6nnSMlbo model id.

POST /_plugins/_ml/models/ghP9kYkB_Tms6nnSMlbo/_deploy

The after deploying it I get failed deployment. Any idea how to fix it? Currently the only way it works for me is through the Python client! However, I have to do the conversions myself if I am going after huggingface models.

Downloading: 100% |========================================| all-MiniLM-L12-v2.zip
[2023-07-31T10:29:32,655][ERROR][o.o.m.m.MLModelManager ] [LAPTOP-C0J5I3L2] Failed to index chunk file
java.security.PrivilegedActionException: null
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at org.opensearch.ml.engine.ModelHelper.downloadAndSplit(ModelHelper.java:197) [opensearch-ml-algorithms-2.9.0.0.jar:?]
at org.opensearch.ml.model.MLModelManager.registerModel(MLModelManager.java:526) [opensearch-ml-2.9.0.0.jar:2.9.0.0]
at org.opensearch.ml.model.MLModelManager.lambda$registerModelFromUrl$19(MLModelManager.java:498) [opensearch-ml-2.9.0.0.jar:2.9.0.0]
at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.9.0.jar:2.9.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.9.0.jar:2.9.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.nio.file.NoSuchFileException: D:\my-files\opensearch-releases\opensearch-2.9.0\data\ml_cache\models_cache\register\NcvUqokBdU6kBA_MJEKP\1\huggingface\sentence-transformers\all-MiniLM-L12-v2.zip

I tried this on 2.9 , it works

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

This API call will return task id, sample response

{
  "task_id" : "lcPmr4kB4eSCtCCDmCD8", 
  "status" : "CREATED"
}

Then use get task API to retrieve task information, find model id in the response

GET _plugins/_ml/tasks/lcPmr4kB4eSCtCCDmCD8

Sample response

{
  "model_id": "fwDmr4kBotjijrTimerP",
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "yYG_ae4URcaSo3k59_IsFQ"
  ],
  "create_time": 1690873272569,
  "last_update_time": 1690873280963,
  "is_async": true
}

Then use deploy API to deploy model fwDmr4kBotjijrTimerP

POST _plugins/_ml/models/fwDmr4kBotjijrTimerP/_deploy

This API will return task id. Similarly , use get task API to retrieve task info, wait for the task status changed to COMPLETED. Then call predict API

POST _plugins/_ml/models/fwDmr4kBotjijrTimerP/_predict
{
  "text_docs": ["hello world"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

One possible reason is your network latency is too high and ml-commons cron job (runs every 10 seconds by default) cleaned up your local cache file before it’s fully downloaded. Suggest to tune the cron job internal to a larger value, for example the below API will increase it to 600 seconds.

PUT /_cluster/settings
{
	"persistent": {
		"plugins.ml_commons.sync_up_job_interval_in_seconds": 600
	}
}

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.