Unusual error with text encoding while ingesting data

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v 2.7.0

Describe the issue:
I recently ingested a large dataset into an OpenSearch cluster. I used AWS’s OpenSearch Ingestion tool (that uses Data Prepper). The source was an S3 bucket with some partitioned JSON data (triggered by SQS messages), the sink was my cluster, and the processors only include “parse_json” and “delete_entries” for removing a key from every object. I’ve used the same pipeline configuration successfully in the past (w OS v2.5).

For some reason, after ingesting, all texts like “van bühl” were saved as “van b�hl”. It seems that all non-ASCII characters like ä or ó or :gem: were changed to the unknown unicode symbol � (\uFFFD). This happened to text data in all fields.

Any idea why this may have happened? Help is really appreciated.

@ishan Did you check the previous index settings, the one you used in 2.5? Maybe you had some analyzer with char_filter set in the field’s mapping?