A customer has a use case where they have a 2 lines of products, Super and Super+.
I believe it is due to the icu_tokenizer
that indexing “Super+” strips the plus sign and just gets indexed as “Super”. As I saw experimenting with the /_analyze
endpoint switching to the whitespace
tokenizer led to the +
remaining. But we must stick with the icu_tokenizer to support documents with a wide range of languages.
I’m seeking discussion/suggestions, no matter how simple or outlandish to try and get this use case to work .
Our index is currently using this configuration with OS Version 2.11:
"analysis": {
"analyzer": {
"english-icu": {
"char_filter": [
"mapping_filter",
"icu_normalizer"
],
"filter": [
"english_stemmer_pos",
"lowercase",
"icu_folding",
"stop",
"english_stemmer"
],
"tokenizer": "icu_tokenizer",
"type": "custom"
}
},
"char_filter": {
"mapping_filter": {
"mappings": [
"™ => "
],
"type": "mapping"
}
},
"filter": {
"english_stemmer": {
"language": "english",
"type": "stemmer"
},
"english_stemmer_pos": {
"language": "possessive_english",
"type": "stemmer"
}
},
"normalizer": {
"case_insensitive": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
}