Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
I am using this as part an API endpoint to filter payload.source.
Describe the issue:
body = {
"from": payload.start,
"size": payload.size,
"query": {
"bool": {
}
},
"sort": [
{
"__created__": {
"order": "asc",
"missing": "_last",
"unmapped_type": "string"
}
}
]
}
match_clauses = []
# IMPORTANT SECTION # right now it results in no partial matching so query "harry-pott" gives no results and queries "harry potter" and "harry-potter" both gives 2 results for "harry potter" and "harry-potter"
if payload.source:
match_clauses.append({
"query_string": {
"default_field": "source",
"query": f'"*{payload.source.lower()}*"'
}
})
#####################
if len(match_clauses) > 0:
body["query"]["bool"]["should"] = match_clauses
else:
body["query"]["bool"]["must"] = [
{"match_all": {}}
]
response = opensearch.client.search(body=body, index=vectorstore_id)
I want to query for the field source such that there are currently 2 different sources “Harry Potter.pdf” and “Harry-Potter.pdf”. I want to be able to search with “Ha”/“ha” and get both, with “Harry Potter”/“harry potter” get only 1, with “Harry-Potter”/“harry-potter” get only 1 so that it is case insensitive and also has partial matching to allow for more than one word (a phrase with hyphen and space taken into account) and half or one and a half word.
# This causes query "harry-potter" to give no results and the query "harry potter" to show the results for "harry potter" and "harry-potter", partial matching works here
# escaped_source = payload.source.lower().replace('-', '\\-').replace('+', '\\+').replace('&&', '\\&&').replace('||', '\\||').replace('!', '\\!')
# match_clauses.append({
# "query_string": {
# "default_field": "source",
# "query": f"*{escaped_source}*"
# }
# })
# This has the same behaviour as above
# match_clauses.append({
# "query_string": {
# "default_field": "source",
# "query": f'"*{payload.source.lower()}*"'
# }
# })
# This only works for single word queries like "harry" and "potter" separately which will give both results
# if payload.source:
# match_clauses.append({
# "wildcard": {
# "source": f"*{payload.source.lower()}*"
# }
# })
# This is the closest where there is partial matching and the queries "harry potter" and "harry-potter" both give the 2 results, now I just need to make it specific so that "harry potter" only gives "Harry Potter In" and "harry-potter" only gives "Harry-Potter-In" and now even though partial matching for "harry potte" works, "harry-potte" gives no results
if payload.source:
match_clauses.append({
"match_phrase": {
"source": payload.source.lower() # Exact match for phrases like "harry potter"
}
})
match_clauses.append({
"query_string": {
"default_field": "source",
"query": f"*{payload.source.lower()}*"
}
})
if len(match_clauses) > 0:
body["query"]["bool"]["should"] = match_clauses
These did not work
Latest update:
input = payload.source.lower()
if payload.source:
body["query"]["bool"]["must"] = {
"wildcard": {
"source": f'*{input}*'
}
}
Which correctly queries “harry_pott” and “harry_potter” to give “Harry_Potter.pdf”, but “harry-pott” and “harry pott” gives no results