Avoid _analyze has exceeded the allowed maximum of [10000] by using chunking pipeline?

devmoreng · January 20, 2025, 9:12am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

“version” : {
“distribution” : “opensearch”,
“number” : “2.15.0”,
“build_type” : “rpm”,
“build_hash” : “61dbcd0795c9bfe9b81e5762175414bc38bbcadf”,
“build_date” : “2024-06-20T03:27:31.591886152Z”,
“build_snapshot” : false,
“lucene_version” : “9.10.0”,
“minimum_wire_compatibility_version” : “7.10.0”,
“minimum_index_compatibility_version” : “7.0.0”
},

Describe the issue:

When indexing documents using the standard analyzer on my text fields, sometimes the fulltext is too large and I get the expected exception:

‘analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.’

I don’t want to increase that limit though, since documents can be of unknown length. Instead I thought I can chunk the fulltext field into fulltext_chunks which each contain a maximum of 10.000 tokens and can be analyzed and only analyze these instead of the large field, by setting index => false on the large field. However, that does not seem to work as I expected: the error still occurs.
Is the analyzer limit not per-field but for the whole document that gets indexed? OR do I have another misunderstanding how it works?

Configuration:

I currently use this mapping for my index:

"mappings": {
  "properties": {
    "author": {
      "type": "keyword"
    },
    "date_change": {
      "type": "date"
    },
    "file_size": {
      "type": "long"
    },
    "file_type": {
      "type": "keyword"
    },
    "fulltext": {
      "type": "text",
      "index": false // This should prevent the analyzer from running over the too large field - correct?
    },
    "fulltext_chunks": {
      "type": "text"
    },
    "language": {
      "type": "keyword"
    },
    "title": {
      "type": "text"
    },
    "url": {
      "type": "keyword"
    },
    "vector_embeddings": {
      "type": "nested",
      "properties": {
        "chunk_text": {
          "type": "text"
        },
        "knn": {
          "type": "knn_vector",
          "dimension": 768,
          "method": {
            "engine": "lucene",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        }
      }
    },
  }
}

Together with the following ingest pipeline:

{
  "description": "Pipeline for generating embeddings from fulltext",
  "processors": [
    {
      "text_chunking": {
        "field_map": {
          "fulltext": "fulltext_chunks"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10000,
            "overlap_rate": 0.1
          }
        }
      }
    },
    {
      "text_chunking": {
        "field_map": {
          "fulltext_chunks": "tmp_chunks"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 384,
            "overlap_rate": 0.2
          }
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "2ggnPJQBN3iZDHu2iDI2",
        "field_map": {
          "tmp_chunks": "tmp_knn"
        }
      }
    },
    {
      "script": {
        "source": "if (ctx.tmp_chunks != null && ctx.tmp_knn != null) { ctx.vector_embeddings = []; for (int i = 0; i < ctx.tmp_chunks.size(); i++) { ctx.vector_embeddings.add(['chunk_text': ctx.tmp_chunks[i], 'knn': ctx.tmp_knn[i].knn]); } }"
      }
    },
    {
      "remove": {
        "ignore_failure": true,
        "field": ["tmp_chunks", "tmp_knn"]
      }
    },
    {
      "remove": {
        "ignore_failure": true,
        "field": "fulltext"
      }
    }
  ]
}

pablo · January 20, 2025, 11:55am

@devmoreng Have you tested text chunking for fulltext_chunks fields with a lower value than 10k? Did you notice any change? Does the error message change?

The default value of token_limit is 384 so that output passages don’t exceed the token limit constraint of the downstream text embedding models. For OpenSearch-supported pretrained models, like msmarco-distilbert-base-tas-b and opensearch-neural-sparse-encoding-v1 , the input token limit is 512 . The standard tokenizer tokenizes text into words. According to OpenAI, 1 token equals approximately 0.75 words of English text. The default token limit is calculated as 512 * 0.75 = 384.

devmoreng · January 20, 2025, 12:34pm

Yes I tried that and also experimented with the _simulate endpoint. Sadly no change in result:

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_state_exception",
            "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
          }
        ],
        "type": "illegal_state_exception",
        "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.",
        "caused_by": {
          "type": "illegal_state_exception",
          "reason": "The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
        }
      }
    }
  ]
}

Something else just came to my mind: could it be, that the analyzer treats array fields of text as a single text to analyze and therfor my chunking does not have any effect?

Sadly, the error message does not specify on which field exactly the analyzer fails. But since I set fulltext to not index and even delete it in my pipeline, I believe it must be fulltext_chunks

pablo · January 20, 2025, 12:48pm

@devmoreng I suspect that could be the case. The token limit is present all the time and can’t be disabled. As per the documentation, the default is 384, so chunking happens anyway.

devmoreng · February 11, 2025, 10:08am

I am currently trying out to limit the amount of tokens analyzed with a custom analyzer, instead of text chunking, but that does not seem to have any effect at all. Can anybody point me in the right direction, how to fix the analyses problem of the token limit?

Now I try with this index settings & mapping, using a custom analyzer with token limit 10.000 on all text fields.

Index settings

    "analysis": {
      "filter": {
        "limit_token_count": {
          "type": "limit",
          "max_token_count": "10000"
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "filter": [
            "lowercase",
            "limit_token_count"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    },

Index mappping

  "mappings": {
    "properties": {
      "author": {
        "type": "keyword"
      },
      "date_change": {
        "type": "date"
      },
      "file_size": {
        "type": "long"
      },
      "file_type": {
        "type": "keyword"
      },
      "fulltext": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "is_public": {
        "type": "boolean"
      },
      "keywords": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "language": {
        "type": "keyword"
      },
      "title": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "url": {
        "type": "keyword"
      },
      "vector_embeddings": {
        "type": "nested",
        "properties": {
          "chunk_text": {
            "type": "text",
            "analyzer": "custom_analyzer"
          },
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "engine": "lucene",
              "space_type": "l2",
              "name": "hnsw",
              "parameters": {}
            }
          }
        }
      },
    }
  }

I would think thast with this custom analyzer being applied to all text fields, the error should go away. But it doesn’t.

yuye-aws · May 14, 2025, 9:11am

You can enlarge the token limit here: Analyze API - OpenSearch Documentation

Topic		Replies	Views
Highlighting Speed - max_analyzed_offset at query-time General Feedback releases , troubleshoot	1	2668	March 19, 2022
Document contains at least one immense term in field=\"search.keyword\" OpenSearch troubleshoot	2	863	September 26, 2023
Set max_analyzer_offset or disable highlighting OpenSearch Dashboards	1	310	January 29, 2025
Anomaly detection fails with Document contains at least one immense term in field="model.normalized_keyword" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped Machine Learning	1	980	October 13, 2021
Opensearch 2.13 Text chunking test error OpenSearch troubleshoot	3	68	September 4, 2024

Avoid _analyze has exceeded the allowed maximum of [10000] by using chunking pipeline?

Related topics