Avoid _analyze has exceeded the allowed maximum of [10000] by using chunking pipeline?

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

“version” : {
“distribution” : “opensearch”,
“number” : “2.15.0”,
“build_type” : “rpm”,
“build_hash” : “61dbcd0795c9bfe9b81e5762175414bc38bbcadf”,
“build_date” : “2024-06-20T03:27:31.591886152Z”,
“build_snapshot” : false,
“lucene_version” : “9.10.0”,
“minimum_wire_compatibility_version” : “7.10.0”,
“minimum_index_compatibility_version” : “7.0.0”
},

Describe the issue:

When indexing documents using the standard analyzer on my text fields, sometimes the fulltext is too large and I get the expected exception:

‘analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.’

I don’t want to increase that limit though, since documents can be of unknown length. Instead I thought I can chunk the fulltext field into fulltext_chunks which each contain a maximum of 10.000 tokens and can be analyzed and only analyze these instead of the large field, by setting index => false on the large field. However, that does not seem to work as I expected: the error still occurs.
Is the analyzer limit not per-field but for the whole document that gets indexed? OR do I have another misunderstanding how it works?

Configuration:

I currently use this mapping for my index:

"mappings": {
  "properties": {
    "author": {
      "type": "keyword"
    },
    "date_change": {
      "type": "date"
    },
    "file_size": {
      "type": "long"
    },
    "file_type": {
      "type": "keyword"
    },
    "fulltext": {
      "type": "text",
      "index": false // This should prevent the analyzer from running over the too large field - correct?
    },
    "fulltext_chunks": {
      "type": "text"
    },
    "language": {
      "type": "keyword"
    },
    "title": {
      "type": "text"
    },
    "url": {
      "type": "keyword"
    },
    "vector_embeddings": {
      "type": "nested",
      "properties": {
        "chunk_text": {
          "type": "text"
        },
        "knn": {
          "type": "knn_vector",
          "dimension": 768,
          "method": {
            "engine": "lucene",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        }
      }
    },
  }
}

Together with the following ingest pipeline:

{
  "description": "Pipeline for generating embeddings from fulltext",
  "processors": [
    {
      "text_chunking": {
        "field_map": {
          "fulltext": "fulltext_chunks"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10000,
            "overlap_rate": 0.1
          }
        }
      }
    },
    {
      "text_chunking": {
        "field_map": {
          "fulltext_chunks": "tmp_chunks"
        },
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 384,
            "overlap_rate": 0.2
          }
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "2ggnPJQBN3iZDHu2iDI2",
        "field_map": {
          "tmp_chunks": "tmp_knn"
        }
      }
    },
    {
      "script": {
        "source": "if (ctx.tmp_chunks != null && ctx.tmp_knn != null) { ctx.vector_embeddings = []; for (int i = 0; i < ctx.tmp_chunks.size(); i++) { ctx.vector_embeddings.add(['chunk_text': ctx.tmp_chunks[i], 'knn': ctx.tmp_knn[i].knn]); } }"
      }
    },
    {
      "remove": {
        "ignore_failure": true,
        "field": ["tmp_chunks", "tmp_knn"]
      }
    },
    {
      "remove": {
        "ignore_failure": true,
        "field": "fulltext"
      }
    }
  ]
}

@devmoreng Have you tested text chunking for fulltext_chunks fields with a lower value than 10k? Did you notice any change? Does the error message change?

The default value of token_limit is 384 so that output passages don’t exceed the token limit constraint of the downstream text embedding models. For OpenSearch-supported pretrained models, like msmarco-distilbert-base-tas-b and opensearch-neural-sparse-encoding-v1 , the input token limit is 512 . The standard tokenizer tokenizes text into words. According to OpenAI, 1 token equals approximately 0.75 words of English text. The default token limit is calculated as 512 * 0.75 = 384.

Yes I tried that and also experimented with the _simulate endpoint. Sadly no change in result:

{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_state_exception",
            "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
          }
        ],
        "type": "illegal_state_exception",
        "reason": "analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.",
        "caused_by": {
          "type": "illegal_state_exception",
          "reason": "The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting."
        }
      }
    }
  ]
}

Something else just came to my mind: could it be, that the analyzer treats array fields of text as a single text to analyze and therfor my chunking does not have any effect?

Sadly, the error message does not specify on which field exactly the analyzer fails. But since I set fulltext to not index and even delete it in my pipeline, I believe it must be fulltext_chunks

@devmoreng I suspect that could be the case. The token limit is present all the time and can’t be disabled. As per the documentation, the default is 384, so chunking happens anyway.

I am currently trying out to limit the amount of tokens analyzed with a custom analyzer, instead of text chunking, but that does not seem to have any effect at all. Can anybody point me in the right direction, how to fix the analyses problem of the token limit?

Now I try with this index settings & mapping, using a custom analyzer with token limit 10.000 on all text fields.

Index settings

    "analysis": {
      "filter": {
        "limit_token_count": {
          "type": "limit",
          "max_token_count": "10000"
        }
      },
      "analyzer": {
        "custom_analyzer": {
          "filter": [
            "lowercase",
            "limit_token_count"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    },

Index mappping

  "mappings": {
    "properties": {
      "author": {
        "type": "keyword"
      },
      "date_change": {
        "type": "date"
      },
      "file_size": {
        "type": "long"
      },
      "file_type": {
        "type": "keyword"
      },
      "fulltext": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "is_public": {
        "type": "boolean"
      },
      "keywords": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "language": {
        "type": "keyword"
      },
      "title": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "url": {
        "type": "keyword"
      },
      "vector_embeddings": {
        "type": "nested",
        "properties": {
          "chunk_text": {
            "type": "text",
            "analyzer": "custom_analyzer"
          },
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "engine": "lucene",
              "space_type": "l2",
              "name": "hnsw",
              "parameters": {}
            }
          }
        }
      },
    }
  }

I would think thast with this custom analyzer being applied to all text fields, the error should go away. But it doesn’t.