Is there a way to create Sparse Neural index that uses raw vectors?

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v2.12
CentoS 7

Describe the issue:
Is it possible to create an index for sparse neural searching, but without using the pipeline to generate the vectors at index time?

The inference step when indexing content makes things very slow to index.

I’m trying to figure out if there’s a way I can cache the vectors so that when I need to re-build an index I can just use the cached vectors. That way I can quickly re-build or re-index our data.

Here’s the workflow I’m curious if I can achieve:

  1. Run text through a sparse model to create the index (such as with the _simulate endpoint).
  2. Take the vectors generated and store them outside of OpenSearch along with our original data.
  3. When indexing an article, pass in the pre-generated vectors (along with all the other index information), skipping the pipeline inference step.

This would provide a few benefits:

  • We often have to update an document in an index and not all of the information being converted to vectors has changed. If we handle the vectorization outside of the index step, we can skip creating vectors for data that has not changed.
  • We often re-build indices. With non-ML indexes, we can generally all of our indexes in minutes (we’re not dealing with millions of docs < 200,000). However, it generally takes use 2-4 seconds per document once we enable the ML pipeline. That makes re-building an index not practical. However, if we could just cache the vectors with our main document source data, we believe that the re-index period would be much closer to what it is now.
  • It gives us a strategy for migrating models. If we can split the vectorization process from the index, if we want to switch models we can run a job to build up to create new vectors based on the new model and once we have all the documents completed, we can just re-index using the new vector information. While we could do this by versioning our indices, we like being able to do this outside of index process.

I’ve looked through the documentation and I’ve been searching, but I don’t see any examples of how this could be done.

I did find that someone wanted to do the same kind of thing for queries:

That seems to be targeted for OpenSearch 2.14.

Is there a way to accomplish this?

To answer my own question, it appears you can use the results from the Predict/Simulate endpoint APIs to generate the embeddings and then just pass the embedded vector values in when indexing your document instead of using the pipeline.

Is there any reason this might be a bad idea?

Hi @dswitzer2 , if you want to ingest data without pipeline, you can ingest like:
Create index without pipeline:

PUT /demo_index/
{
  "mappings": {
      "properties": {
        "passage_text": {
            "type": "text"
        },
        "passage_sparse": {
            "type": "rank_features"
        }
    }
  }
}

Index data:

POST /demo_index/_doc/1
{
"passage_text": "hello demo",  
 "passage_sparse": {"demo": 0.5, "hello": 1.0}
}
```,

then you can search this index using neural sparse clause.
2 Likes

Yes, I can confirm this works.

For your idea here, this is actually what we do in the ingestion pipeline. If you want to store a copy outside the opensearch, this is a good solution.

Thanks for confirming!

When we re-index content a lot of time the data being embedded may not have changed, so being able to cache our embedding will provide us a boost. This will be especially useful when migrating between environments so we can skip the inference process completely when we re-index content into a fresh environment.

1 Like