Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Describe the issue:
In several OpenSearch projects we’ve seen stemming introduce noise early in the pipeline, especially in multilingual setups.
For example:
-
“organization” → “organ”
-
“news” → “new”
-
“united” → “unit”
These kinds of transformations collapse unrelated terms into the same form, which affects matching quality and introduces noise into the index.
In practice, this often leads to:
-
less precise retrieval
-
more reliance on query-side complexity (ngrams, fuzzy, etc.)
-
inconsistent behavior across languages
We’ve been exploring an alternative approach using a lightweight plugin that adds proper lemmatization and decompounding before indexing. It’s simple to integrate and doesn’t require changes to query logic.
So far, we’re seeing improvements in:
-
lexical matching
-
index quality
-
consistency (also when combined with semantic search)
More detailed examples here:
https://www.linkedin.com/pulse/how-increase-search-relevance-elasticsearch-better-text-tony-chac%C3%B3n-arkic
Curious how others are handling this kind of issue.
Configuration:
- OpenSearch (standard analyzers with stemming)
- Multilingual datasets
- Combination of lexical and semantic search in some cases
Relevant Logs or Screenshots: