Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Describe the issue:
I’ve configured a Data Prepper ingestion pipeline to index documents into OpenSearch. Initially, OpenSearch automatically generated the document ID for each ingested document. Later, I discovered a configuration option that allows using a specific field from the incoming JSON as the document ID. I updated the pipeline to use the id field by setting document_id: ${/id}. While this correctly assigns the document ID, the issue is that the id field is still included in the document body, resulting in duplication. This redundancy increases storage usage unnecessarily. I’m looking for a configuration or feature in Data Prepper that allows using a field as the document ID without indexing it in the document source. Is there a recommended way to achieve this?
Configuration:
```sink:
-
opensearch:
hosts:
- "index: “data-prepper-0001”
username: admin
password: password
document_id: ${/id}
insecure: true
```
Relevant Logs or Screenshots:
I’m ingesting JSON documents into OpenSearch using Data Prepper. For example, the incoming JSON looks like this:
```[
{
"id": "12345",
"name": "test user",
"action": "testing"
}
]
```
After indexing, the document appears in OpenSearch as:
```
{
“took”: 1,
“timed_out”: false,
“_shards”: {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
“hits”: {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": \[
{
"\_index": "data-prepper-0001",
"\_id": "12345",
"\_score": 1,
"\_source": {
"id": "12345",
"name": "test user",
"action": "testing"
}
}
\]
}
}
```
As you can see, the id field is used both as the document ID (_id) and is also stored in the document body (_source), causing duplication. I’m looking for a way to use the id field as the document ID without storing it in the _source to reduce storage overhead.