How to scale neural sparse ingestion pipeline

grunt-solaces.0h · April 22, 2024, 4:51pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

AOS 2.11

Describe the issue:

This question comes as a continuation of this thread: How to register sparse encoding model in AWS OpenSearch - #9 by grunt-solaces.0h

Setup summary:

we use neural sparse search with one of the pretrained models deployed in SageMaker on a G4 instance
the ingestion is pretty slow, but usable
the same model is used both for ingestion and search (i.e. same model_id in OS queries)
the the size of the opensearch_ml_predict thread pools is 24 (3 nodes x 8 vCPUs)
the SageMaker instance is heavily underutilised

The problem occurs when we have spikes in ingestion, i.e. a large number of documents to ingest at once. They are pushed in batches of 1000. What happens is that the documents get queued and sent to the SageMaker machine by OS using the opensearch_ml_predict thread pool. If at the same time a search coming from the user reaches the cluster, it will also be queued. But obviously, it has to wait until all the already queued documents have been processed, which can take in the range of tens of seconds, causing poor search experience for the user.

We’ve explored two approaches: increase the ingestion throughput or prioritise the search request coming from the user.

For the first approach, one option would be to add more machines to the cluster. But this seems less than ideal, as we would do it just to get more CPUs and the increase in throughput is negligible. We would need to increase the cluster considerably to get a real improvement.

For the second approach, we’ve thought of using _prefer_nodes when performing the search. However, it’s not clear if this would ignore data that is stored on the nodes that are not included in the parameter.

Another option that we’ve experimented w/ is reducing the batch of documents that are sent for indexing to <= 24. Like the thread pool queue is not filled and search requests coming from the user can be processed fast enough. However, this seems very limiting.

Any suggestion are more than welcome.

Note: the replacement of the blocking httpclient w/ the async version seems to be in progress (highly appreciated) but the release schedule is not clear. GH issue here: [FEATURE] Replace blocking httpclient with async httpclient in remote inference · Issue #1839 · opensearch-project/ml-commons · GitHub

Thank you!

zhichao-aws · April 29, 2024, 9:18am

Hi @grunt-solaces.0h , the async httpclient can fix this problem, and it is scheduled to be released at version 2.15.

With AOS 2.11, there is another workaround for this problem. In OpenSearch 2.11, each ml remote inference will block a opensearch_ml_predict thread. With bulk API, we’ll create one ml remote inference task for every single document. So as you observed, for a bulk request with size 1000, there will be 1000 tasks queued.

However, with model predict API, we only create one task for the batch of input documents. But please set the batch size carefully to avoid CUDA out of memory exception. For g4dn instance the batch size between [50,100] will be fine. Therefore, for your use case, you can split the documents with a batch size between [50,100], then call predict API to encode them in sequence. Then you can directly ingest documents with sparse vectors (make sure to remove the sparse_encoding processor for the index, otherwise it still creates model inference tasks). With this workaround, the ingestion only consumes one opensearch_ml_predict thread and won’t block the search.

Here is a code example to use python client encode batch of docs:

# extract the raw texts from the docs
docs = ["doc text 1","doc text 2","doc text 3"]
response = client.transport.perform_request("POST",f"/_plugins/_ml/models/{model_id}/_predict",body={
    "parameters":{
        "input":docs
    }
})
sparse_vectors=response["inference_results"][0]["output"][0]["dataAsMap"]["response"]
# then you should set the sparse fields in docs using these sparse vectors

zhichao-aws · May 7, 2024, 4:52am

latest update: async http client feature has been merged at 2.14 release Change httpclient to async by zane-neo · Pull Request #1958 · opensearch-project/ml-commons · GitHub

grunt-solaces.0h · May 7, 2024, 8:59am

Thanks for the alternative solution and for the async client update!

Topic		Replies	Views
[RFC] neural sparse models improvement plan General Feedback	6	306	July 17, 2024
Opensearch ingestion is slow and timeouts are occuring very frequently OpenSearch	11	254	January 20, 2025
Opensearch Performance tuning OpenSearch	6	650	October 10, 2023
[Feedback] Machine Learning Model Serving Framework - Experimental Release General Feedback releases	48	2945	July 12, 2023
Performance and scaling of ML models and dense vector data Machine Learning discuss	6	712	May 12, 2023

How to scale neural sparse ingestion pipeline

Related topics