Synonym strategy for fractional, decimal, and unit variations of the same measurement

Describe the issue:

We have an industrial tools catalog on OpenSearch where users search for dimensions in wildly inconsistent formats. For example, all of these should return the same results:

  • 1/4

  • 1/4"

  • 1/4in

  • 1/4 inch

  • 0.25

  • 0.25"

  • 0.25in

  • 0.25 inches

  • .250

  • 1/4 with curly/smart quotes (1/4" or 1/4") instead of straight quotes

The challenge is multi-layered:

  • Fractional to decimal equivalence (1/4 = 0.25 = .250)

  • Unit symbol normalization (", in, inch, inches, and curly quote variants like " '' )

  • Combinations of the above

What’s the recommended approach here? We’ve considered:

  • Synonym filters — but the combinatorial explosion of fraction/decimal/unit combos seems hard to maintain

  • Custom char filters to normalize quotes and strip unit suffixes at index and query time

  • A normalization layer before the query hits OpenSearch that converts everything to a canonical form (e.g., always decimal, no units)

Has anyone tackled this at scale? Curious whether this is better solved at the analyzer level, with synonyms, or with a pre-processing step outside OpenSearch.

@carlosjimenezdev To achieve what you are looking for you would have to handle the arithmetic conversion on application level, before indexing/searching, making sure that only decimal values are sent to Opensearch (.25 instead of 1/4). Opensearch can then do the necessary processing using analysers to remove the additional “inch”, “in” and quotes, thereby indexing and searching only using the decimal notation. See example below:

PUT /industrial_tools4
{
  "settings": {
    "analysis": {
      "char_filter": {
        "unit_strip": {
          "type": "pattern_replace",
          "pattern": "(?i)\\s*(inch(?:es)?|in|\u2033|\u0022|\u0027|\u201c|\u201d)\\s*",
          "replacement": ""
        },
        "leading_dot_fix": {
          "type": "pattern_replace",
          "pattern": "(?<![0-9])\\.([0-9]+)",
          "replacement": "0.$1"
        }
      },
      "analyzer": {
        "dimension_analyzer": {
          "type": "custom",
          "char_filter": ["unit_strip", "leading_dot_fix"],
          "tokenizer": "whitespace",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "sku":                 { "type": "keyword" },
      "product_name":        { "type": "text", "analyzer": "standard" },
      "product_type":        { "type": "text", "analyzer": "standard", "fields": { "keyword": { "type": "keyword" } } },
      "dimensions_catchall": { "type": "text", "analyzer": "dimension_analyzer" },
      "popularity_score":    { "type": "float" },
      "is_featured":         { "type": "boolean" },
      "sales_rank":          { "type": "float" },
      "cutter_diameter_str": {
        "type": "text",
        "analyzer": "dimension_analyzer",
        "copy_to": "dimensions_catchall",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "shank_diameter_str": {
        "type": "text",
        "analyzer": "dimension_analyzer",
        "copy_to": "dimensions_catchall",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "overall_length_str": {
        "type": "text",
        "analyzer": "dimension_analyzer",
        "copy_to": "dimensions_catchall",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "flute_length_str": {
        "type": "text",
        "analyzer": "dimension_analyzer",
        "copy_to": "dimensions_catchall",
        "fields": { "keyword": { "type": "keyword" } }
      }
    }
  }
}

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": ".25" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25\"" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25in" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25in" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25 inch" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25 inches" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25\u2033" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25\u201c" }

GET /industrial_tools4/_analyze
{ "analyzer": "dimension_analyzer", "text": "0.25\u201d" }

Hope this helps

@Anthony Thanks for your reply. We have another issue on the same path with text spacing.

As an example, searching for “endmill", or “end mill” should return the same results. The product name sometimes will be “endmill”, and sometimes will be “end mill”. Regardless, we should get both product names as matches for both search terms.

Is there a way to accomplish that other than setting up synonyms?