Sparse model prompt

For the sparse search models (e.g. amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill) is it possible improve the embeddings by including a bit of context with the sequence?

Say I know all my documents are about computer programming. I’m experimenting with things like: mytext = 'In the following sequence focus on computer programming topics [SEP] ' + myactualtext

This impacts the rank vector primarily in 2 ways:

  1. It finds some new non-zero terms
  2. It includes the terms from my prepended context and associated expanded terms

For #2, can I delete these from the rank vector somehow, or prevent them from getting ranked in the first place?

Am I just boiling the ocean and should I accept the performance without any such tinkering, which is already very good?

Thank you!

Hi @nattaylor, I think it can’t help improve the performance. During the training of these sparse models, we didn’t add any context related prompts. Usually we’d suggest to keep the test-time settings identical to training times. And the performance with prompts hasn’t been evaluated

1 Like

BTW we’ve released v3-gte model at HF recently: https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte. And we’ll strongly suggest to have a try :slight_smile:

Thank you for the insight! I hadnt considered train time - test time alignment, but that makes sense.

I will give the new model a try too

Still I want to emphasize how delighted I am with the excellent out-of-the-box performance!

1 Like