Not all Hindi matches are being returned when using custom analyzer

aakashbashyal21 · October 25, 2024, 7:33am

I am working on a FastAPI project that uses OpenSearch with a custom index to support both Hindi and English text. I created a custom analyzer for Hindi, but I am encountering an issue where search results seem incomplete.

Index Creation:

I created an index with a custom analyzer for the chapter_title field like this:

def create_index():
    """Create an OpenSearch index with language-specific analyzers for Hindi and English."""
    if not opensearch_client.indices.exists(INDEX_NAME):
        opensearch_client.indices.create(
            INDEX_NAME,
            body={
                "settings": {
                    "analysis": {
                        "analyzer": {
                            "custom_chapter_title_analyzer": {
                                "type": "custom",
                                "tokenizer": "standard",
                                "filter": [
                                    "hindi_normalization",
                                    "indic_normalization",
                                ],
                            }
                        }
                    }
                },
                "mappings": {
                    "properties": {
                        "book_title": {"type": "text", "analyzer": "english"},
                        "chapter_title": {
                            "type": "text",
                            "analyzer": "custom_chapter_title_analyzer",
                        },
                    }
                },
            },
        )

Search Implementation:

For searching, I am using a multi_match query on both book_title and chapter_title fields:

@app.get("/search/")
def search_books(q: str):
    
    print('\nSearch query:')
    print(q)

    query = {
        'size': 5,
        "query": {
            "multi_match": {
                "query": q,
                "fields": ["book_title", "chapter_title"]
            }
        }
    }
    response = opensearch_client.search_book(query)
    print('\nSearch results:')
    print(response)
    return response

Issue:

When searching for the term “संगठन”, I only get results for one of the following chapter_title values:

संगठनः अनूठा और क्रांतिकारी
संगठन और धर्म

I only get a match for “संगठन और धर्म” but not for “संगठनः अनूठा और क्रांतिकारी”.

I suspect this might be due to the : character (the symbol “ः”) being treated differently during tokenization, but I’m unsure how to resolve this.

Like wise:

In search when i have: केंद्रः

I get this result as first result:

ध्यान-केंद्रः
जीवन का गुह्यतम केंद्रः मृत्यु

But when i have only the search query as: केंद्र

I didn’t get above result, i get only matching value of केंद्र

Question:

How can I adjust my custom analyzer to ensure that search queries return all results containing the Hindi term “संगठन”, regardless of the presence of punctuation or other special characters like “ः”?

Is this even possible?

Topic		Replies	Views
Mapping - Fields Type English or Custom Analyzers OpenSearch discuss	13	2040	June 13, 2022
Configure Tokenizers and Analyzers for easy searching OpenSearch	2	1608	June 27, 2022
"search_as_you_type" "index_prefix" field does not return document with "match" query but does with "match_phrase" query OpenSearch discuss , troubleshoot	0	386	July 31, 2023
Specific Languages Analysis OpenDistro discuss	1	781	February 7, 2023
Partial search match on synonym OpenSearch troubleshoot	0	421	March 1, 2024

Not all Hindi matches are being returned when using custom analyzer

Related topics