DSL query_string returns no record for some searches

manoj · June 15, 2023, 7:13am

Hi all,
I’m using OpenSearch 1.2.4
In my OpenSearch cluster, I have one index with one document.

$ curl localhost:9200/_search?pretty
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "6PPYvYgBtk-qpEG4Mocr",
        "_score" : 1.0,
        "_source" : {
          "message" : "1c78-4bc1-af44-4e29b661822e,PDF_Document_ap1,PDF_Document_sp1,2023-06-15T05:30:22.667Z,106,91,500,Internal Server Error,HTTP_2,Https_2,Http_2,POST,v1"
        }
      }
    ]
  }
}

I’m trying to search a keyword PDF_Document_ap1 using query_string.

curl -X GET "localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
 "query": {
    "query_string": {
      "query": "PDF_Document_ap1"
    }
  }
}'

And I’m getting the expected output.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "6PPYvYgBtk-qpEG4Mocr",
        "_score" : 0.2876821,
        "_source" : {
          "message" : "1c78-4bc1-af44-4e29b661822e,PDF_Document_ap1,PDF_Document_sp1,2023-06-15T05:30:22.667Z,106,91,500,Internal Server Error,HTTP_2,Https_2,Http_2,POST,v1"
        }
      }
    ]
  }
}

Now, I’m trying another keyword PDF_Document_sp1 from the same field.

curl -X GET "localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
 "query": {
    "query_string": {
      "query": "PDF_Document_sp1"
    }
  }
}'

This time, it’s not returning the expected output.

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Tried another keyword 500 from the same field.

curl -X GET "localhost:9200/test/_search?pretty" -H 'Content-Type: application/json' -d '
{
 "query": {
    "query_string": {
      "query": "500"
    }
  }
}'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Can anyone help me in finding the cause here?

NOTE: Tried with OpenSearch 2.8.0 too. Still this issue persists.

Thanks.

Gsmitt · June 15, 2023, 10:16pm

Hey @manoj

Do you see anything in th log files that may help resolve this issue?

gaobinlong · June 16, 2023, 4:44am

By _analyze API, you can see that result of word segmentation:

POST _analyze
{
  "text": "1c78-4bc1-af44-4e29b661822e,PDF_Document_ap1,PDF_Document_sp1,2023-06-15T05:30:22.667Z,106,91,500,Internal Server Error,HTTP_2,Https_2,Http_2,POST,v1"
}

part of the result:

{
      "token": "pdf_document_ap1",
      "start_offset": 28,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "pdf_document_sp1,2023",
      "start_offset": 45,
      "end_offset": 66,
      "type": "<ALPHANUM>",
      "position": 5
    },
{
      "token": "106,91,500",
      "start_offset": 87,
      "end_offset": 97,
      "type": "<NUM>",
      "position": 10
    },

,this result can give answers, pdf_document_ap1 is a single token, but pdf_document_sp1 and 500 are not, so you cannot get any results when you use keyword PDF_Document_sp1 and 500. The result of word segmentation comes from the default analyzer standard_analyzer, maybe you can define custom analyzer for the message field or use wildcard in you query but will have poorer performance:

GET xabc/_search
{
  "profile": true, 
  "query": {
    "query_string": {
      "query": "*PDF_Document_sp1*"
    }
  }
}

manoj · June 16, 2023, 5:04am

Since this is just a search query, there is no info in the logs.

manoj · June 16, 2023, 5:06am

Is there any specific reason or logic behind this ?
Like why “pdf_document_sp1” and “2023” are combined to form a single token ? Instead of two individual tokens ?

gaobinlong · June 16, 2023, 6:53am

I think that’s because 1 and 2023 are numbers, so they are in one token, and same to 106,91,500, maybe because , is common used in printing big numbers so standard analyzer think they must be in one token, see more here. Standard analyzer works well for words separated by spaces, a new idea is that using ingest pipeline to replace the comma to space when writing documents to index, and then the problem will be solved, you can try it if possible.

radu.gheorghe · June 17, 2023, 5:09pm

You can also change the analyzer (from standard to something else). Probably a custom analyzer with the pattern tokenizer and the lowercase token filter will do the job you’re looking for.

Topic		Replies	Views
Not all results are shown for query OpenSearch	2	990	August 7, 2022
Query_shard_exception and HTTP response codes OpenSearch discuss , troubleshoot	0	445	March 8, 2024
How can I get the Json DSL from a Query object in java client OpenSearch Client Libraries opensearch-java	4	3533	August 4, 2023
What should be the search query syntax to match documents OpenSearch all-clients	0	412	May 6, 2022
Different results from different search sources OpenSearch troubleshoot	0	25	August 20, 2024

DSL query_string returns no record for some searches

Related topics