How to register sparse encoding model in AWS OpenSearch

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

AWS OpenSearch 2.11

Describe the issue:

I’m trying to deploy the pretrained model amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 on AWS OpensSearch to use it for Neural Sparse Search but it doesn’t seem to work.

The full request:

POST /_plugins/_ml/models/_register
{
    "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1",
    "version": "1.0.0",
    "model_group_id": "<group-id>",
    "description": "This is a neural sparse encoding model: It transfers text into sparse vector, and then extract nonzero index and value to entry and weights. It serves only in ingestion and customer should use tokenizer model in query.",
    "model_format": "TORCH_SCRIPT",
    "function_name": "SPARSE_ENCODING",
    "model_content_hash_value": "9a41adb6c13cf49a7e3eff91aef62ed5035487a6eca99c996156d25be2800a9a",
    "url":  "https://artifacts.opensearch.org/models/ml-models/amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1/1.0.0/torch_script/opensearch-neural-sparse-encoding-doc-v1-1.0.0-torch_script.zip"
}

Which causes OpenSearch to return the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "To upload custom model user needs to enable allow_registering_model_via_url settings. Otherwise please use opensearch pre-trained models."
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "To upload custom model user needs to enable allow_registering_model_via_url settings. Otherwise please use opensearch pre-trained models."
  },
  "status": 400
}

When I try to set the allow_registering_model_via_url to true, I get the next one:

Full request:

PUT /_cluster/settings
{
    "persistent": {
        "plugins.ml_commons.allow_registering_model_via_url": true
    }
}

Error:

{
  "Message": "Your request: '/_cluster/settings' payload is not allowed."
}

I suspect AWS doesn’t allow updating that setting.

Any ideas on how the model can be deployed?

Thank you!

When I try to set the allow_registering_model_via_url to true, I get the next one:

@grunt-solaces.0h , Yes, in managed service we don’t allow this setting yet.

@xinyual could you please take a look at this issue?

Thanks
Dhrubo

1 Like

Hi @grunt-solaces.0h, thanks for your interests on neural sparse. In AOS, we cannot register via URL and in 2.11 you have to deploy model on sagemaker. See CloudFormation Integration in AOS here. Although we haven’t updated the doc, integration with sparse model is supported there.

1 Like

Thanks @dhrubo and @xinyual for taking the time to respond!

I deployed the model via SageMaker and now it works.

However, the throughput of the ingestion pipeline is quite low.

I’m using the following setup:

  • AOS cluster: single node of r6g.xlarge.search, w/ 200GB of General Purpose (SSD) - gp3
  • SageMaker setup: amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 deployed on 1 instance of ml.r5.4xlarge
  • The documents are pushed in bulk and I’ve setup an ingestion pipeline which uses the model from SageMaker
  • the index’s refresh_interval is set to 60s

From what I can tell the docs are processed sequentially. This seem to be confirmed by the SageMaker metrics: CPU stays at ~1K% and memory at ~25% regardless of how many documents I include in the bulk request.

What could be done to speed it up?

Thank you!

@xinyual Is CloudFormation Integration the only way to access opensearch-neural-sparse-encoding-v1 in AWS OpenSearch? Is there any way in SageMaker to register the model directly by its URL or is the use of CloudFormation a requirement here?

About CloudFormation, should I expect to see neural-sparse-encoding as an available option when following the steps under Amazon SageMaker Template Integration? Any additional detail on the setup process for sparse encodings in OpenSearch service would be appreciated. I ran into the exact same problem and questions described in @grunt-solaces.0h 's first post and was not able to find answers in docs.

@xinyual @dhrubo we’ve made some progress on the indexing performance by using a machine with more CPUs for AOS.

The reasoning behind: it seems like the thread pool opensearch_ml_predict is used for handling the calls to SageMaker and it seems like it’s being sized based on the number of CPUs the AOS instance has.

Is there a way to increase the number of threads for this thread pool via settings? Or is there another way to increase the parallelism of the ingestion pipeline? It would be ideal for us since at the moment the ingestion pipeline seems to be the bottleneck and the SageMaker endpoint underutilized.

Thank you!

Hi @rs-search , we could deploy sparse model via sagemaker and remote inference. See here to use remote model. But we couldn’t register model via url. I think Neural sparse button is already available in AOS integration.

Hi @grunt-solaces.0h Unfortunately, improving thread number doesn’t have a influence on inference speed. Please check sagemaker endpoint CPU utilization to see whether it is fully used. Actually for a better ingestion throughput, we prefer to use GPU instance like g4dn, g5 or p3

Hi @xinyual,

Thanks for the answer!

We were wondering about the thread number for the following reasons:

  1. We switched to a larger machine, with more CPUs and as a result it seems the ingestion pipeline is sending more documents for inference, in parallel
  2. At first we used a CPU instance for inference and it was being underutilised as it was processing only X documents, where X seemed to be the number of CPUs in the OS instance
  3. We switched to a GPU instance for inference which is again underutilised. Indeed, it does the inference faster per document, but it seems like the bottleneck is still the ingestion pipeline which doesn’t send as many documents to the SageMaker instance as it is capable of handling

Hi @grunt-solaces.0h, the predict thread pool is originally designed for local model CPU deployment case. For remote case, there’s bottleneck on httpclient, e.g. if you only have one instance sending requests to remote model, the remote model instance could be underutilized. We have identified another performance issue on httpclient and working on fix these by enabling user to configure MAX_CONNECTION count in httpclient. But I think increasing predict thread pool size can also increase the performance in your case, although after this you might encounter httpclient performance bottleneck issue. But it’s good to have a try, you can change the node.processors value in opensearch.yml file, the predict thread pool size is calculated with: num(node.processors) * 2.

Thanks @zane_neo , this makes sense!

Will MAX_CONNECTION be configurable via PUT _cluster/settings?

We can’t set the node.processors value as we use OpenSearch deployed in Amazon and we don’t have access to that config so we’ll have to wait for the httpclient fix.

In fact, our current http client implementation is not optimal, I’ve created this issue: [FEATURE] Replace blocking httpclient with async httpclient in remote inference · Issue #1839 · opensearch-project/ml-commons · GitHub to track this. BTW, if you can scale more instances(no need high performance ones) and deploy your model to these instances, it can be a workaround to improve the parallelism performance.

Great, thank you @zane_neo ! We really appreciate it!

How deployed the model via SageMaker, please guide me?

Hi @darvel , sorry for late reply. Which version of OS do you use? If you use Amazon Opensearch in AWS, you can find integration in UI and it will help you deploy the sparse model on sagemaker. For open source version, please make sure your sparse endpoint takes list of string as input and output a list of <String, float> dictionary.