Improving OpenSearch relevance with better normalization (beyond stemming)

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Describe the issue:

In several OpenSearch projects we’ve seen stemming introduce noise early in the pipeline, especially in multilingual setups.

For example:

  • “organization” → “organ”

  • “news” → “new”

  • “united” → “unit”

These kinds of transformations collapse unrelated terms into the same form, which affects matching quality and introduces noise into the index.

In practice, this often leads to:

  • less precise retrieval

  • more reliance on query-side complexity (ngrams, fuzzy, etc.)

  • inconsistent behavior across languages

We’ve been exploring an alternative approach using a lightweight plugin that adds proper lemmatization and decompounding before indexing. It’s simple to integrate and doesn’t require changes to query logic.

So far, we’re seeing improvements in:

  • lexical matching

  • index quality

  • consistency (also when combined with semantic search)

More detailed examples here:
https://www.linkedin.com/pulse/how-increase-search-relevance-elasticsearch-better-text-tony-chac%C3%B3n-arkic

Curious how others are handling this kind of issue.

Configuration:

  1. OpenSearch (standard analyzers with stemming)
  2. Multilingual datasets
  3. Combination of lexical and semantic search in some cases

Relevant Logs or Screenshots:

One thing we’re seeing is that many teams compensate with ngrams/fuzzy or move to semantic but the normalization layer is still noisy.

Curious if anyone has tried improving normalization before indexing instead?