[RFC] neural sparse models improvement plan

Hi all, I’m a contributor for the neural sparse feature. Now we have several directions for improving the neural sparse feature, and we want to collect some feedback directly from our users.

Here are several directions to improve neural sparse features,

  1. Smaller model size and faster model inference. This will increase neural sparse model ingestion throughput, and speed up the bi-encoder search
  2. Better search relevance.
  3. Speeds up the end2end search latency
  4. Easy to use improvements.
    a. code demos to use the neural sparse ingest/search end2end
    b. code demos to deploy neural sparse models at GPU endpoints (SageMaker or self-hosted text encoding).
    c. code demos to fine-tune neural sparse at customized dataset
    d. provide one-click API to deploy the models and setup ingest/search pipelines
  5. provide multi-lingual sparse-encoding models
  6. provide multi-modal sparse-encoding models

We want to collect feedback from our users and this help us prioritize our work. Could you please leave your comments about these improvements?

Other suggestions:

  • Include a code demo using text chunking and sparse vectors in nested fields
  • Make sure ML tools (RAG tool, etc.) support sparse vectors in nested fields

Thanks so much!

1 Like

Hi @zhichao-aws,

Our main pain points are around ingestion throughput (1) and search latency (3). This is mainly due to the shared threadpool in OS.

Another one was 4b, as it was not obvious from the docs that the model cannot be deployed inside OS and we need to deploy in SageMaker.

Relevant forum threads with additional details:

Thanks for working on this!

Hi @grunt-solaces.0h , for ingestion throughput, we’re training models with 0.5x parameters and 0.25x parameters with enhanced pretraining procedures. Now we’re seeing primitive results and I believe we can see these new models in a few months.
And after the ml-commons throttling issues get fixed, we can achieve the high throughput with large batch size

For search latency, we strongly recommend to upgrade to AOS 2.13, which is GA several days ago. The retrieval latency on inverted index gets speed up at large margin, and we’ll also publish a blog to tell the speed up and do some quantative analysis. With AOS 2.13, we can also deploy the model inside OS with API like this:

POST /_plugins/_ml/models/_register
    "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1",
    "version": "1.0.1",
    "model_format": "TORCH_SCRIPT"

In 2.15 we plan to onboard the neural sparse 2-phase search, which can speeds up the bi-encoder raw sparse vector retrieval time 4x~8x in our experiments.