"Document contains at least one immense term in field" error when indexing large value in flat_object field

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch server v 2.11.0

Describe the issue:
Hi,

I am getting a “Document contains at least one immense term in field” error when indexing a document with a largish (~100 KB) value in a ‘flat_object’ type field. I thought the purpose of the flat_object field was to allow storing (but not indexing) large structured data (from the documentation: “The flat object field type solves this problem by treating the entire JSON object as a string. Subfields within the JSON object are accessible using standard dot path notation, but they are not indexed for fast lookup.”)…

The error says that there is a 32K limit to the term size:

Document contains at least one immense term in field="foo._value" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[83, 111, 117, 114, 99, 101, 32, 69, 120, 116, 101, 110, 115, 105, 111, 110, 32, 83, 101, 116, 115, 58, 68, 97, 116, 97, 32, 81, 117, 97]...', original message: bytes can be at most 32766 in length; got 99431

Why is the term length even an issue here? I just want to store the value - not index it. Do flat_object fields get indexed as a single term? I tried explicitly configuring the mapping to tell it not to index the field:

PUT /testindex/_mapping
{
  "properties": {
    "foo": {
      "type": "flat_object",
      "index": false
    }
  }
}

but that resulted in an error:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "Mapping definition for [foo] has unsupported parameters:  [index : false]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "Mapping definition for [foo] has unsupported parameters:  [index : false]"
  },
  "status": 400
}

The flat_object documentation notes that “The maximum field value length in the dot notation is 2^24 − 1.”, which is ~16 MB, so why am I having trouble inserting a mere 100 KB value into a flat_object type field?

Any insight would be much appreciated.

1 Like

After some experimentation, I found that the issue is the size of an individual leaf field value and not the size of the flat object overall. However, based on the documentation, it seems that this should not be a problem at all because it says that the whole object is treated as a single string and subfields are explicitly not indexed. From the documentation (emphasis added):

“The flat object field type solves this problem by treating the entire JSON object as a string. Subfields within the JSON object are accessible using standard dot path notation, but they are not indexed for fast lookup.”

The example below shows that when 32K is split between 2 subfields in the flat object, there is no problem, but if one subfield has 32K then the document indexing fails.

It looks like subfields are in fact being indexed (as keywords perhaps) and therefore the 32K term size limit is being applied to each leaf subfield. As mentioned above, the documentation indicates that this should not be the case. Is this a bug? At the very least you should be able to configure flat_object field mappings to not index subfields by specifying index=false in the mapping, but as I mentioned in the original post, that configuration is currently rejected. I also tried specifying a non-keyword analyzer for the flat_object field, so if subfields are indexed, the terms will be shorter - that too was rejected by the system.

Example:

  1. Create index with flat_object field foo

Request

PUT /testindex
{
  "mappings": {
    "properties": {
      "foo": {
        "type": "flat_object"
      }
    }
  }
}

Response

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "testindex"
}
  1. Index document with 2 subfields each of size 16K (actual values truncated to fit within forum post size constraints)

Request

POST /testindex/_doc
{
  "foo": {
    "subfoo1": "word word word word word word word word word word ...",
    "subfoo2": "word word word word word word word word word word ..."
  }
}

Response

{
  "_index": "testindex",
  "_id": "8DVK2IsBdJ0ZDzEauLyA",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
  1. Index document with 2 subfields - 1 empty and one with 1 of size 32K (actual values truncated to fit within forum post size constraints)

Request

POST /testindex/_doc
{
  "foo": {
    "subfoo1": "",
    "subfoo2": "word word word word word word word word word word ..."
  }
}

Response

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Document contains at least one immense term in field=\"foo._value\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32]...', original message: bytes can be at most 32766 in length; got 32768"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Document contains at least one immense term in field=\"foo._value\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32, 119, 111, 114, 100, 32]...', original message: bytes can be at most 32766 in length; got 32768",
    "caused_by": {
      "type": "max_bytes_length_exceeded_exception",
      "reason": "max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 32768"
    }
  },
  "status": 400
}
1 Like

Well, it looks like what I really wanted was an ‘object’ field type with ‘enabled’=false (to store and retrieve large JSON content without indexing individual fields) rather than a ‘flat_object’:

PUT /testindex
{
  "mappings": {
    "properties": {
      "foo": {
        "type": "object",
        "enabled": false
      }
    }
  }
}
1 Like

FYI, it appears that the data for a flatObject gets stored as 3 ‘system’ fields which are in-turn keywords. Lucene’s max size limit for a single term is 32kb. Apparently, newer versions of OpenSearch can tokenize a keyword field into individual terms and store each separately in Lucene (haven’t been able to verify this yet) which greatly increases the supported size of a keyword field type. However, I believe that if one of the individual terms exceeds 32kb this is where the error is still thrown and why a ‘leaf’ can be the cause of the error.

1 Like