Ignore_above not Ignoring Large Terms

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
opensearch 2.11

Describe the issue:
I set the ignore_above option to 32000 for the message field in the mapping table of a specific index in OpenSearch.
However, when I tried to insert certain data into the index, I encountered the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Document contains at least one immense term in field=\"message\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[60, 49, 51, 52, 62, 49, 32, 50, 48, 50, 52, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 50, 48, 58, 49, 48, 32, 49, 55, 50]...', original message: bytes can be at most 32766 in length; got 34255"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Document contains at least one immense term in field=\"message\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[60, 49, 51, 52, 62, 49, 32, 50, 48, 50, 52, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 50, 48, 58, 49, 48, 32, 49, 55, 50]...', original message: bytes can be at most 32766 in length; got 34255",
    "caused_by": {
      "type": "max_bytes_length_exceeded_exception",
      "reason": "max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 34255"
    }
  },
  "status": 400
}

According to the error message, the message field length is 34255, which caused the error.
However, I had clearly set ignore_above to 32000, so why wasn’t the value of 34255 ignored and why was it attempted to be indexed?
The length of the value in the “message” field is 37,695 characters including spaces and 37,322 characters excluding spaces. Anyway, the character length exceeds 32,000.

A second question: I changed the ignore_above value for the message field to 8000 and tried to insert the same log into the index again. This time, the message field was ignored and the document was successfully indexed. Why did this work correctly this time?

ignore_above is character count but the max bytes validation checks bytes, since UTF-8 character occupies at most 4 bytes, so ignore_above can be set up to 32766 / 4 = 8191 for UTF-8 text in your case.

1 Like