Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
“version” : {
“distribution” : “opensearch”,
“number” : “2.15.0”,
“build_type” : “rpm”,
“build_hash” : “61dbcd0795c9bfe9b81e5762175414bc38bbcadf”,
“build_date” : “2024-06-20T03:27:31.591886152Z”,
“build_snapshot” : false,
“lucene_version” : “9.10.0”,
“minimum_wire_compatibility_version” : “7.10.0”,
“minimum_index_compatibility_version” : “7.0.0”
},
Describe the issue:
When indexing documents using the standard analyzer on my text fields, sometimes the fulltext is too large and I get the expected exception:
‘analyzer standard throws exception: The number of tokens produced by calling _analyze has exceeded the allowed maximum of [10000]. This limit can be set by changing the [index.analyze.max_token_count] index level setting.’
I don’t want to increase that limit though, since documents can be of unknown length. Instead I thought I can chunk the fulltext field into fulltext_chunks which each contain a maximum of 10.000 tokens and can be analyzed and only analyze these instead of the large field, by setting index => false
on the large field. However, that does not seem to work as I expected: the error still occurs.
Is the analyzer limit not per-field but for the whole document that gets indexed? OR do I have another misunderstanding how it works?
Configuration:
I currently use this mapping for my index:
"mappings": {
"properties": {
"author": {
"type": "keyword"
},
"date_change": {
"type": "date"
},
"file_size": {
"type": "long"
},
"file_type": {
"type": "keyword"
},
"fulltext": {
"type": "text",
"index": false // This should prevent the analyzer from running over the too large field - correct?
},
"fulltext_chunks": {
"type": "text"
},
"language": {
"type": "keyword"
},
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"vector_embeddings": {
"type": "nested",
"properties": {
"chunk_text": {
"type": "text"
},
"knn": {
"type": "knn_vector",
"dimension": 768,
"method": {
"engine": "lucene",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
}
}
},
}
}
Together with the following ingest pipeline:
{
"description": "Pipeline for generating embeddings from fulltext",
"processors": [
{
"text_chunking": {
"field_map": {
"fulltext": "fulltext_chunks"
},
"algorithm": {
"fixed_token_length": {
"token_limit": 10000,
"overlap_rate": 0.1
}
}
}
},
{
"text_chunking": {
"field_map": {
"fulltext_chunks": "tmp_chunks"
},
"algorithm": {
"fixed_token_length": {
"token_limit": 384,
"overlap_rate": 0.2
}
}
}
},
{
"text_embedding": {
"model_id": "2ggnPJQBN3iZDHu2iDI2",
"field_map": {
"tmp_chunks": "tmp_knn"
}
}
},
{
"script": {
"source": "if (ctx.tmp_chunks != null && ctx.tmp_knn != null) { ctx.vector_embeddings = []; for (int i = 0; i < ctx.tmp_chunks.size(); i++) { ctx.vector_embeddings.add(['chunk_text': ctx.tmp_chunks[i], 'knn': ctx.tmp_knn[i].knn]); } }"
}
},
{
"remove": {
"ignore_failure": true,
"field": ["tmp_chunks", "tmp_knn"]
}
},
{
"remove": {
"ignore_failure": true,
"field": "fulltext"
}
}
]
}