Our system is using hybrid search (bm25+dense semantic) for a while now. We have lately been aware of sparse vectors from opensearch but not so clear on how it might fit on a hybrid search and reranking setup.
in terms of relevancy, how exactly is it different from dense embedding? I can only see some reduced memory based on the documentation. And some vector related stuff is unclear to me as I am not an ML engineer. Perhaps someone can explain it more simply?
I’ve read from some sources that sparse vectors excels in keyword-based search, but we are already using bm25 as keyword search. How is sparse vectors compare to bm25 or lexical search?
finally, what is the recommended way to do hybrid search? lexical+dense, sparse+dense, or lexical+sparse+dense?
you may expect almost the same results. Sparse model is derived from dense model, so it carries the same semantics but in a different form (1k floats vs a few weighted terms from the 30K dictionary). Sparse search experiences the same issues as the dense one: [1] short query might not capture “enough semantics” (frankly speaking); [2] if you have a specific acronym/brand name the model might have no idea about it (lexical search matches what you have in index, not what model saw in training); [3] it barely able to handle typos. I’d say, I see one advantage of sparse over dense: for dense search there’s no such thing as not-found, it always finds nearest something whatever it is. With sparse search you might expect to get zero results that might be important.
lexical search matches symbols to symbols. orange==orange, it unable to match oranges to citrus and we have to patch it with explicit synonyms. Vector search handles it by learning all texts from internet and encode sentences in (dense) thousands of floats. We can think about sparse search as a weighted query (?and index!) expansion ie auto-synonyms with wise boosting. Three major issues of sparse search are already enumerated above.
if you already have a hybrid search, I don’t expect significant improvement from adding same RoBERTA model one more time in a sparse form (SPLADEv2). Probably you’d be able to get zero results to trigger some merch rules or so. But as always, trust noone but benchmark on your dataset.
From my experience: if high recall is your priority, dense embeddings are the way to go. However, if you need to avoid the added latency of query-to-dense-embedding conversion and want to use the natural language query as-is, sparse embeddings are the better choice.