I am working on a FastAPI
project that uses OpenSearch
with a custom index to support both Hindi and English text. I created a custom analyzer
for Hindi, but I am encountering an issue where search results seem incomplete.
Index Creation:
I created an index with a custom analyzer for the chapter_title
field like this:
def create_index():
"""Create an OpenSearch index with language-specific analyzers for Hindi and English."""
if not opensearch_client.indices.exists(INDEX_NAME):
opensearch_client.indices.create(
INDEX_NAME,
body={
"settings": {
"analysis": {
"analyzer": {
"custom_chapter_title_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"hindi_normalization",
"indic_normalization",
],
}
}
}
},
"mappings": {
"properties": {
"book_title": {"type": "text", "analyzer": "english"},
"chapter_title": {
"type": "text",
"analyzer": "custom_chapter_title_analyzer",
},
}
},
},
)
Search Implementation:
For searching, I am using a multi_match query on both book_title and chapter_title fields:
@app.get("/search/")
def search_books(q: str):
print('\nSearch query:')
print(q)
query = {
'size': 5,
"query": {
"multi_match": {
"query": q,
"fields": ["book_title", "chapter_title"]
}
}
}
response = opensearch_client.search_book(query)
print('\nSearch results:')
print(response)
return response
Issue:
When searching for the term “संगठन”, I only get results for one of the following chapter_title
values:
- संगठनः अनूठा और क्रांतिकारी
- संगठन और धर्म
I only get a match for “संगठन और धर्म” but not for “संगठनः अनूठा और क्रांतिकारी”.
I suspect this might be due to the : character (the symbol “ः”) being treated differently during tokenization, but I’m unsure how to resolve this.
Like wise:
In search when i have: केंद्रः
I get this result as first result:
- ध्यान-केंद्रः
- जीवन का गुह्यतम केंद्रः मृत्यु
But when i have only the search query as: केंद्र
I didn’t get above result, i get only matching value of केंद्र
Question:
How can I adjust my custom analyzer to ensure that search queries return all results containing the Hindi term “संगठन”, regardless of the presence of punctuation or other special characters like “ः”?
Is this even possible?