Unusual error with text encoding while ingesting data

ishan · July 26, 2023, 2:18am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v 2.7.0

Describe the issue:
I recently ingested a large dataset into an OpenSearch cluster. I used AWS’s OpenSearch Ingestion tool (that uses Data Prepper). The source was an S3 bucket with some partitioned JSON data (triggered by SQS messages), the sink was my cluster, and the processors only include “parse_json” and “delete_entries” for removing a key from every object. I’ve used the same pipeline configuration successfully in the past (w OS v2.5).

For some reason, after ingesting, all texts like “van bühl” were saved as “van b�hl”. It seems that all non-ASCII characters like ä or ó or were changed to the unknown unicode symbol � (\uFFFD). This happened to text data in all fields.

Any idea why this may have happened? Help is really appreciated.

pablo · July 26, 2023, 10:38am

@ishan Did you check the previous index settings, the one you used in 2.5? Maybe you had some analyzer with char_filter set in the field’s mapping?

Topic		Replies	Views
JSON.parse: bad escaped character OpenSearch	14	762	July 23, 2024
Support for non-ascii characters Open Source Elasticsearch and Kibana	0	418	January 18, 2023
Json parse exceptions beginning with OS 2.11.0 OpenSearch troubleshoot	2	894	November 3, 2023
Message is getting truncated OpenSearch troubleshoot	2	86	January 1, 2025
Ingesting json files within aws pipeline(dataprepper) Data Prepper	2	235	November 8, 2024

Unusual error with text encoding while ingesting data

Related topics