Neural sparse search is giving all the documents

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): opensearch version 2.19

Describe the issue: I have deployed opensearch-neural-sparse-encoding-v2-distill for neural sparse search query. I have indexed around 3000 documents.. its ecommerce catalog. i am using name field and pushing brand+category+product_name+features combined string in name field. I am searching particular text like green tea, its giving me all 3000 documents. although I have only 70-80 relevant products. what could be the issue?

Configuration: POST /_plugins/_ml/models/_register?deploy=true

{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT"
}
PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
{
  "description": "A sparse encoding ingest pipeline",
  "processors": [
    {
      "sparse_encoding": {
        "model_id": "<model_id>",
        "prune_type": "max_ratio",
        "prune_ratio": 0.2,
        "field_map": {
          "sparse_fulltext": "name_sparse"
        }
      }
    }
  ]
}
POST items/_search
{
  "_source": {
    "excludes": ["name_v"]
  },
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "should": [
            {
              "neural_sparse": {
                "name_sparse": {
                  "query_text": "Green Tea",
                  "model_id": "model_id"
                }
              }
            },
            {
              "match": {
                "name": {
                  "query": "Green tea",
                  "operator": "and"
                }
              }
            }
          ],
          "minimum_should_match": 1
        }
      },
      "script": {
        "source": "_score"
      }
    }
  }
}

Relevant Logs or Screenshots:

{
  "took": 60,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2407,
      "relation": "eq"
    },
    "max_score": 31.46042,
    "hits": []
  }
}

@Mihir You mentioned that you are pushing brand+category+product_name+features combined string in name field, do you mean into “sparse_fulltext” field?

However, in general what you are seeing is expected, The hits are returned ordered by score, where the highest score is returned at the top. Neural sparse (opensearch-neural-sparse-encoding-v2-distill) is a "learned sparse retriever, it expands the query, documents, and assigns weights to a large vocabulary. Many documents will share at least one non-zero token with a given query, so they get non-zero scores and show up in hits.total.

You can add min_score to limit the number of returned documents is needed, see below example:

POST items/_search
{
  "size": 10,
  "min_score": 10,      
  "_source": ["sku", "category", "name"],
  "query": {
    "neural_sparse": {
      "name_sparse": {
        "query_text": "green tea",
        "model_id": "<model_id>"
      }
    }
  }
}

Hi @Anthony, Thanks for the prompt reply and for refactoring my question as well. Yes, In sparse_fulltext I am appending those values. Is it wrong approach?
Also I am applying function score to boost my bestsellers at the top.. and also multiplying it with in stock status so that even relevant product if out of stock, should not appear at top.. By adding min_score, function score will not work as expected.. Any way to handle this?

@Mihir combining all those strings into one field is not a good approach, as many products can share very generic terms, which will match during search, which is exactly what is happening in your case. It would be much better to encode mainly product_name + maybe a short category/brand string.

Regarding min_score, it’s applied after the query scoring is done (including function_score or script_score), therefore you can still use this.

To work around “in stock” concept, it would be better to update the query to something like this:

"query": {
  "bool": {
    "filter": [
      { "term": { "in_stock": true } }
    ],
    "must": [
      {
        "neural_sparse": {
          "name_sparse": {
            "query_text": "Green Tea",
            "model_id": "<model_id>"
          }
        }
      }
    ]
  }
}

@Anthony , I dont want to remove out of stock products at all. but it should appear at the end even though it matches exactly.. SO filter will not help.. and function score will reduce the score for out of stock products or can boost the non relevant product if its besellers as its coming in neural sparse result. so If I select min_score : 10, my out of stock product will never display..

@Mihir have you thought about using a hybrid search, if you want strict matching on the particular field, using something like the example below:

POST items/_search
{
  "size": 20,
  "_source": ["sku", "category", "name"],
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": {
              "query": "Green Tea",
              "operator": "and"
            }
          }
        }
      ],
      "should": [
        {
          "neural_sparse": {
            "name_sparse": {
              "query_text": "Green Tea",
              "model_id": "<model_id>"
            }
          }
        }
      ]
    }
  }
}