DSL query_string returns no record for some searches

Hi all,
I’m using OpenSearch 1.2.4
In my OpenSearch cluster, I have one index with one document.

$ curl localhost:9200/_search?pretty
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "6PPYvYgBtk-qpEG4Mocr",
        "_score" : 1.0,
        "_source" : {
          "message" : "1c78-4bc1-af44-4e29b661822e,PDF_Document_ap1,PDF_Document_sp1,2023-06-15T05:30:22.667Z,106,91,500,Internal Server Error,HTTP_2,Https_2,Http_2,POST,v1"
        }
      }
    ]
  }
}

I’m trying to search a keyword PDF_Document_ap1 using query_string.

curl -X GET "localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
 "query": {
    "query_string": {
      "query": "PDF_Document_ap1"
    }
  }
}'

And I’m getting the expected output.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "6PPYvYgBtk-qpEG4Mocr",
        "_score" : 0.2876821,
        "_source" : {
          "message" : "1c78-4bc1-af44-4e29b661822e,PDF_Document_ap1,PDF_Document_sp1,2023-06-15T05:30:22.667Z,106,91,500,Internal Server Error,HTTP_2,Https_2,Http_2,POST,v1"
        }
      }
    ]
  }
}

Now, I’m trying another keyword PDF_Document_sp1 from the same field.

curl -X GET "localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
 "query": {
    "query_string": {
      "query": "PDF_Document_sp1"
    }
  }
}'

This time, it’s not returning the expected output.

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Tried another keyword 500 from the same field.

curl -X GET "localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
 "query": {
    "query_string": {
      "query": "500"
    }
  }
}'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Can anyone help me in finding the cause here?

NOTE: Tried with OpenSearch 2.8.0 too. Still this issue persists.

Thanks.

Hey @manoj

Do you see anything in th log files that may help resolve this issue?

By _analyze API, you can see that result of word segmentation:

POST _analyze
{
  "text": "1c78-4bc1-af44-4e29b661822e,PDF_Document_ap1,PDF_Document_sp1,2023-06-15T05:30:22.667Z,106,91,500,Internal Server Error,HTTP_2,Https_2,Http_2,POST,v1"
}

part of the result:

{
      "token": "pdf_document_ap1",
      "start_offset": 28,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "pdf_document_sp1,2023",
      "start_offset": 45,
      "end_offset": 66,
      "type": "<ALPHANUM>",
      "position": 5
    },
{
      "token": "106,91,500",
      "start_offset": 87,
      "end_offset": 97,
      "type": "<NUM>",
      "position": 10
    },

,this result can give answers, pdf_document_ap1 is a single token, but pdf_document_sp1 and 500 are not, so you cannot get any results when you use keyword PDF_Document_sp1 and 500. The result of word segmentation comes from the default analyzer standard_analyzer, maybe you can define custom analyzer for the message field or use wildcard in you query but will have poorer performance:

GET xabc/_search
{
  "profile": true, 
  "query": {
    "query_string": {
      "query": "*PDF_Document_sp1*"
    }
  }
}

Since this is just a search query, there is no info in the logs.

Is there any specific reason or logic behind this ?
Like why “pdf_document_sp1” and “2023” are combined to form a single token ? Instead of two individual tokens ?

I think that’s because 1 and 2023 are numbers, so they are in one token, and same to 106,91,500, maybe because , is common used in printing big numbers so standard analyzer think they must be in one token, see more here. Standard analyzer works well for words separated by spaces, a new idea is that using ingest pipeline to replace the comma to space when writing documents to index, and then the problem will be solved, you can try it if possible.

You can also change the analyzer (from standard to something else). Probably a custom analyzer with the pattern tokenizer and the lowercase token filter will do the job you’re looking for.