I’m trying to deploy the pretrained model amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 on AWS OpensSearch to use it for Neural Sparse Search but it doesn’t seem to work.
The full request:
POST /_plugins/_ml/models/_register
{
"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1",
"version": "1.0.0",
"model_group_id": "<group-id>",
"description": "This is a neural sparse encoding model: It transfers text into sparse vector, and then extract nonzero index and value to entry and weights. It serves only in ingestion and customer should use tokenizer model in query.",
"model_format": "TORCH_SCRIPT",
"function_name": "SPARSE_ENCODING",
"model_content_hash_value": "9a41adb6c13cf49a7e3eff91aef62ed5035487a6eca99c996156d25be2800a9a",
"url": "https://artifacts.opensearch.org/models/ml-models/amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1/1.0.0/torch_script/opensearch-neural-sparse-encoding-doc-v1-1.0.0-torch_script.zip"
}
Which causes OpenSearch to return the following error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "To upload custom model user needs to enable allow_registering_model_via_url settings. Otherwise please use opensearch pre-trained models."
}
],
"type": "illegal_argument_exception",
"reason": "To upload custom model user needs to enable allow_registering_model_via_url settings. Otherwise please use opensearch pre-trained models."
},
"status": 400
}
When I try to set the allow_registering_model_via_url to true, I get the next one:
Full request:
PUT /_cluster/settings
{
"persistent": {
"plugins.ml_commons.allow_registering_model_via_url": true
}
}
Error:
{
"Message": "Your request: '/_cluster/settings' payload is not allowed."
}
I suspect AWS doesn’t allow updating that setting.
Hi @grunt-solaces.0h, thanks for your interests on neural sparse. In AOS, we cannot register via URL and in 2.11 you have to deploy model on sagemaker. See CloudFormation Integration in AOS here. Although we haven’t updated the doc, integration with sparse model is supported there.
I deployed the model via SageMaker and now it works.
However, the throughput of the ingestion pipeline is quite low.
I’m using the following setup:
AOS cluster: single node of r6g.xlarge.search, w/ 200GB of General Purpose (SSD) - gp3
SageMaker setup: amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 deployed on 1 instance of ml.r5.4xlarge
The documents are pushed in bulk and I’ve setup an ingestion pipeline which uses the model from SageMaker
the index’s refresh_interval is set to 60s
From what I can tell the docs are processed sequentially. This seem to be confirmed by the SageMaker metrics: CPU stays at ~1K% and memory at ~25% regardless of how many documents I include in the bulk request.
@xinyual Is CloudFormation Integration the only way to access opensearch-neural-sparse-encoding-v1 in AWS OpenSearch? Is there any way in SageMaker to register the model directly by its URL or is the use of CloudFormation a requirement here?
About CloudFormation, should I expect to see neural-sparse-encoding as an available option when following the steps under Amazon SageMaker Template Integration? Any additional detail on the setup process for sparse encodings in OpenSearch service would be appreciated. I ran into the exact same problem and questions described in @grunt-solaces.0h 's first post and was not able to find answers in docs.
@xinyual@dhrubo we’ve made some progress on the indexing performance by using a machine with more CPUs for AOS.
The reasoning behind: it seems like the thread pool opensearch_ml_predict is used for handling the calls to SageMaker and it seems like it’s being sized based on the number of CPUs the AOS instance has.
Is there a way to increase the number of threads for this thread pool via settings? Or is there another way to increase the parallelism of the ingestion pipeline? It would be ideal for us since at the moment the ingestion pipeline seems to be the bottleneck and the SageMaker endpoint underutilized.
Hi @rs-search , we could deploy sparse model via sagemaker and remote inference. See here to use remote model. But we couldn’t register model via url. I think Neural sparse button is already available in AOS integration.
Hi @grunt-solaces.0h Unfortunately, improving thread number doesn’t have a influence on inference speed. Please check sagemaker endpoint CPU utilization to see whether it is fully used. Actually for a better ingestion throughput, we prefer to use GPU instance like g4dn, g5 or p3
We were wondering about the thread number for the following reasons:
We switched to a larger machine, with more CPUs and as a result it seems the ingestion pipeline is sending more documents for inference, in parallel
At first we used a CPU instance for inference and it was being underutilised as it was processing only X documents, where X seemed to be the number of CPUs in the OS instance
We switched to a GPU instance for inference which is again underutilised. Indeed, it does the inference faster per document, but it seems like the bottleneck is still the ingestion pipeline which doesn’t send as many documents to the SageMaker instance as it is capable of handling
Hi @grunt-solaces.0h, the predict thread pool is originally designed for local model CPU deployment case. For remote case, there’s bottleneck on httpclient, e.g. if you only have one instance sending requests to remote model, the remote model instance could be underutilized. We have identified another performance issue on httpclient and working on fix these by enabling user to configure MAX_CONNECTION count in httpclient. But I think increasing predict thread pool size can also increase the performance in your case, although after this you might encounter httpclient performance bottleneck issue. But it’s good to have a try, you can change the node.processors value in opensearch.yml file, the predict thread pool size is calculated with: num(node.processors) * 2.
Will MAX_CONNECTION be configurable via PUT _cluster/settings?
We can’t set the node.processors value as we use OpenSearch deployed in Amazon and we don’t have access to that config so we’ll have to wait for the httpclient fix.
Hi @darvel , sorry for late reply. Which version of OS do you use? If you use Amazon Opensearch in AWS, you can find integration in UI and it will help you deploy the sparse model on sagemaker. For open source version, please make sure your sparse endpoint takes list of string as input and output a list of <String, float> dictionary.