Sparse embeddings vs. dense embeddings vs. lexical searchh

Hi all,

Our system is using hybrid search (bm25+dense semantic) for a while now. We have lately been aware of sparse vectors from opensearch but not so clear on how it might fit on a hybrid search and reranking setup.

  • in terms of relevancy, how exactly is it different from dense embedding? I can only see some reduced memory based on the documentation. And some vector related stuff is unclear to me as I am not an ML engineer. Perhaps someone can explain it more simply?
  • I’ve read from some sources that sparse vectors excels in keyword-based search, but we are already using bm25 as keyword search. How is sparse vectors compare to bm25 or lexical search?
  • finally, what is the recommended way to do hybrid search? lexical+dense, sparse+dense, or lexical+sparse+dense?

Thanks all!

  • you may expect almost the same results. Sparse model is derived from dense model, so it carries the same semantics but in a different form (1k floats vs a few weighted terms from the 30K dictionary). Sparse search experiences the same issues as the dense one: [1] short query might not capture “enough semantics” (frankly speaking); [2] if you have a specific acronym/brand name the model might have no idea about it (lexical search matches what you have in index, not what model saw in training); [3] it barely able to handle typos. I’d say, I see one advantage of sparse over dense: for dense search there’s no such thing as not-found, it always finds nearest something whatever it is. With sparse search you might expect to get zero results that might be important.
  • lexical search matches symbols to symbols. orange==orange, it unable to match oranges to citrus and we have to patch it with explicit synonyms. Vector search handles it by learning all texts from internet and encode sentences in (dense) thousands of floats. We can think about sparse search as a weighted query (?and index!) expansion ie auto-synonyms with wise boosting. Three major issues of sparse search are already enumerated above.
  • if you already have a hybrid search, I don’t expect significant improvement from adding same RoBERTA model one more time in a sparse form (SPLADEv2). Probably you’d be able to get zero results to trigger some merch rules or so. But as always, trust noone but benchmark on your dataset.

from the SPLADEv2 whitepaper

(2) the results are competitive with state-of-the-art dense retrieval methods

also IMHO: Sparse retrieval might be faster in query-time

From my experience: if high recall is your priority, dense embeddings are the way to go. However, if you need to avoid the added latency of query-to-dense-embedding conversion and want to use the natural language query as-is, sparse embeddings are the better choice.

Please elaborate: how do you get sparse query encoder faster than dense? How you may use the natural query as is with sparse search?

We wrote a detailed blog about this: Sparse vs Dense Vectors: How Lexical and Semantic Search Actually Work - BigData Boutique . Hybrid is usually vector (semantic) + keyword, so sparse in that sense will be BM25 (keyword search) and + Dense. But it really depends on your scale, intended effort and cost, and how you define and measure “good results”.