Predict API slower than index pipeline

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v2.12
CentOS 7

Describe the issue:
I’m looking at using the Predict API so that I can build my spare embedded vectors so that I can cache them outside of OpenSearch.

However, I’ve noticed that using the Predict API to generate the vectors appears to be much slower than when I just index the same content using a pipeline.

For example, if I use the Predict API to generate the vector embeddings for a number of documents, using something like the following, it’s about 1.5x slower than if I just use a pipeline to generate the embeddings:

POST /_plugins/_ml/_predict/sparse_encoding/MY_MODEL_ID_HERE
{"text_docs":["doc 1 text", "doc 2 text", "etc, etc, etc"]}

Relevant Logs or Screenshots:
Here’s what my internal logs show of using the Predict API to generate the embeddings:

Starting the index (500 documents)...
Processed Chunk #1 (100 of 500)... [@ 3m 3s]
Processed Chunk #2 (200 of 500)... [@ 6m 52s]
Processed Chunk #3 (300 of 500)... [@ 10m 0s]
Processed Chunk #4 (400 of 500)... [@ 12m 52s]
Processed Chunk #5 (500 of 500)... [@ 15m 49s]
Refreshing index...
Finished indexing in 15m 50s!

While if I have the pipeline handle the embeddings, this is what I see:

Starting the index (500 documents)...
Processed Chunk #1 (100 of 500)... [@ 1m 14s]
Processed Chunk #2 (200 of 500)... [@ 2m 32s]
Processed Chunk #3 (300 of 500)... [@ 3m 36s]
Processed Chunk #4 (400 of 500)... [@ 4m 42s]
Processed Chunk #5 (500 of 500)... [@ 5m 57s]
Refreshing index...
Finished indexing in 5m 57s!

I’ve tried using the Predict API on a single document or processing multiple documents in bulk. It appears to be more efficient to send the documents to the Predict API in bulk, but that’s still much slower than using a pipeline.

Is this extra overhead because of the resources needed to return the embeddings?

Is there a difference in the way that the pipeline method actually generates the embeddings that makes things faster?

Are you using a remote connector for Sparse mode deployment or you deploy it locally.

I think it’s local deployment, since the function name is sparse_encoding instead of remote connector

Hi @dswitzer2 , can you share more details? E.g. the number of OpenSearch nodes, how you invoke the predict api and how you invoke bulk api?

Based on current information, if you invoke the predict and bulk api with a request body contains 100 documents, and if you have multiple cluster nodes, this result is as expectation. The reason is opensearch cluster will handle the predict api invokation with the Round Robin schedule. Even if there are 100 documents in one api calling, only one node will handle this request. But for bulk api, we will call predict API for each document, so there will be 100 internal invokation for predict api and all ml nodes will participate the inference. So the performance gap is caused by 1 node vs multiple nodes. To call predict API with faster speed, you can break the request into multiple smaller requests and call them in parallel.

If the above reason is not the root cause, can you share the ml model stats for bulk and predict api? Thanks! ref: Profile - OpenSearch Documentation

I was wondering if the bulk insert w/the pipeline was using asynchronous operations for the pipeline embeddings, where as the Predict API ends up being single threaded. I was hoping feeding in multiple docs to the Predict API would end up performing any of the same kind of asynchronous operations as the bulk indexing w/a pipeline does.

This is actually a single node at the moment, which is why I was surprised at the results. I’m just prototyping some things and playing around, so everything is being down in a local VM.

As for how I’m invoking the Predict API, I listed that above:

POST /_plugins/_ml/_predict/sparse_encoding/MY_MODEL_ID_HERE
{"text_docs":["doc 1 text", "doc 2 text", "etc, etc, etc"]}

(But the “doc X text” would be the actual long text for which I’m trying to get the embeddings.)

In doing some more testing, what I’ve found is if I use the Pipeline’s Simulate endpoint, I get the same kind of performance as the bulk insert.

So if I do this instead, I’m seeing much, much better performance (at least 3 times as fast):

POST /_ingest/pipeline/MY_INGEST_PIPELINE_HERE/_simulate
{
	"docs": [
		  {"_id": "1", "_source":{"article": "doc 1 text"}}
		, {"_id": "2", "_source":{"article": "doc 2 text"}}
		, {"_id": "X", "_source":{"article": "etc, etc, etc"}}
	]
}

And here’s want my ingest pipeline looks like:

PUT /_ingest/pipeline/neural_sparse_ingest_pipeline
{
	"description": "An example neural sparse encoding pipeline",
	"processors": [
		{
			"sparse_encoding": {
				"model_id": "MY_MODEL_ID_HERE",
				"field_map": {
						"article": "article_embedding"
				}
			}
		}
	]
}

And here’s the Model I’m using:

POST /_plugins/_ml/models/_register
{
	"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1",
	"version": "1.0.1",
	"model_group_id": "MY_MODEL_GROUP_ID_HERE",
	"description": "This is a neural sparse encoding model: It transfers text into sparse vector, and then extract nonzero index and value to entry and weights. It serves only in ingestion and customer should use tokenizer model in query.",
	"model_format": "TORCH_SCRIPT",
	"function_name": "SPARSE_ENCODING",
	"model_content_hash_value": "9a41adb6c13cf49a7e3eff91aef62ed5035487a6eca99c996156d25be2800a9a",
	"url": "https://artifacts.opensearch.org/models/ml-models/amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1/1.0.1/torch_script/neural-sparse_opensearch-neural-sparse-encoding-doc-v1-1.0.1-torch_script.zip"
}

I’m assuming the “simulate” usings the same kind of asynchronous logic that an actual index does.

However, is there a reason that the Predict API should be slower than the simulating the pipeline in a single node environment?

Here’s the log from using Simulate:

Starting the index (500 documents)...
Processed Chunk #1 (100 of 500)... [@ 1m 16s]
Processed Chunk #2 (200 of 500)... [@ 2m 23s]
Processed Chunk #3 (300 of 500)... [@ 3m 38s]
Processed Chunk #4 (400 of 500)... [@ 5m 16s]
Processed Chunk #5 (500 of 500)... [@ 7m 0s]
Refreshing index...
Finished indexing in 7m 0s!

Compared to the Predict API it’s much more aligned with the bulk index using the pipeline.

Also, if instead of doing a bulk Predict call if I changed my code to make my calls asynchronous, I can confirm I get speeds much more inline with what I expect:

Starting the index (500 documents)...
Processed Chunk #1 (100 of 500)... [@ 1m 14s]
Processed Chunk #2 (200 of 500)... [@ 2m 29s]
Processed Chunk #3 (300 of 500)... [@ 3m 49s]
Processed Chunk #4 (400 of 500)... [@ 5m 24s]
Processed Chunk #5 (500 of 500)... [@ 7m 22s]
Refreshing index...
Finished indexing in 7m 22s!

It would be great if the Predict API had a parallel/async option so that bulk operations could be down in parallel instead of having them be synchronous.

Hi @dswitzer2 , I think the reason is for Bulk API with pipeline, we will handle each document as a request and there are several threads to deal with these requests. And for predict API, one calling would be only one request even with several texts inside. But if you make your calls asynchronous, there are also several threads to handle your predict call at the same time. So the bottleneck of both pipeline and predict API is the model latency. That’s why two methods have close time.