Customising ICU tokenizer to preserve specific terms

A customer has a use case where they have a 2 lines of products, Super and Super+.
I believe it is due to the icu_tokenizer that indexing “Super+” strips the plus sign and just gets indexed as “Super”. As I saw experimenting with the /_analyze endpoint switching to the whitespace tokenizer led to the + remaining. But we must stick with the icu_tokenizer to support documents with a wide range of languages.

I’m seeking discussion/suggestions, no matter how simple or outlandish to try and get this use case to work :slight_smile:.

Our index is currently using this configuration with OS Version 2.11:

"analysis": {
  "analyzer": {
    "english-icu": {
      "char_filter": [
        "mapping_filter",
        "icu_normalizer"
      ],
      "filter": [
        "english_stemmer_pos",
        "lowercase",
        "icu_folding",
        "stop",
        "english_stemmer"
      ],
      "tokenizer": "icu_tokenizer",
      "type": "custom"
    }
  },
  "char_filter": {
    "mapping_filter": {
      "mappings": [
        "™ => "
      ],
      "type": "mapping"
    }
  },
  "filter": {
    "english_stemmer": {
      "language": "english",
      "type": "stemmer"
    },
    "english_stemmer_pos": {
      "language": "possessive_english",
      "type": "stemmer"
    }
  },
  "normalizer": {
    "case_insensitive": {
      "filter": [
        "lowercase"
      ],
      "type": "custom"
    }
  }
}

@Issun I did some testing and I couldn’t find any other solution than transforming + sign to a word i.e. ‘_plus’
As you’ve already noticed the icu_tokenizer drops the special character and this is not configurable in OpenSearch as far as I’m aware.