Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
AOS 2.11
Describe the issue:
This question comes as a continuation of this thread: How to register sparse encoding model in AWS OpenSearch - #9 by grunt-solaces.0h
Setup summary:
- we use neural sparse search with one of the pretrained models deployed in SageMaker on a G4 instance
- the ingestion is pretty slow, but usable
- the same model is used both for ingestion and search (i.e. same
model_id
in OS queries) - the the size of the
opensearch_ml_predict
thread pools is 24 (3 nodes x 8 vCPUs) - the SageMaker instance is heavily underutilised
The problem occurs when we have spikes in ingestion, i.e. a large number of documents to ingest at once. They are pushed in batches of 1000. What happens is that the documents get queued and sent to the SageMaker machine by OS using the opensearch_ml_predict
thread pool. If at the same time a search coming from the user reaches the cluster, it will also be queued. But obviously, it has to wait until all the already queued documents have been processed, which can take in the range of tens of seconds, causing poor search experience for the user.
We’ve explored two approaches: increase the ingestion throughput or prioritise the search request coming from the user.
For the first approach, one option would be to add more machines to the cluster. But this seems less than ideal, as we would do it just to get more CPUs and the increase in throughput is negligible. We would need to increase the cluster considerably to get a real improvement.
For the second approach, we’ve thought of using _prefer_nodes
when performing the search. However, it’s not clear if this would ignore data that is stored on the nodes that are not included in the parameter.
Another option that we’ve experimented w/ is reducing the batch of documents that are sent for indexing to <= 24. Like the thread pool queue is not filled and search requests coming from the user can be processed fast enough. However, this seems very limiting.
Any suggestion are more than welcome.
Note: the replacement of the blocking httpclient w/ the async version seems to be in progress (highly appreciated) but the release schedule is not clear. GH issue here: [FEATURE] Replace blocking httpclient with async httpclient in remote inference · Issue #1839 · opensearch-project/ml-commons · GitHub
Thank you!