How to Remove Document ID Field After Using It as `document_id` in opensearch sink option

Aravinth · August 31, 2025, 5:52pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Describe the issue:
I’ve configured a Data Prepper ingestion pipeline to index documents into OpenSearch. Initially, OpenSearch automatically generated the document ID for each ingested document. Later, I discovered a configuration option that allows using a specific field from the incoming JSON as the document ID. I updated the pipeline to use the id field by setting document_id: ${/id}. While this correctly assigns the document ID, the issue is that the id field is still included in the document body, resulting in duplication. This redundancy increases storage usage unnecessarily. I’m looking for a configuration or feature in Data Prepper that allows using a field as the document ID without indexing it in the document source. Is there a recommended way to achieve this?

Configuration:

```sink:

opensearch:

hosts:
```
- "
```
index: “data-prepper-0001”

username: admin

password: password

document_id: ${/id}

insecure: true
```

Relevant Logs or Screenshots:
I’m ingesting JSON documents into OpenSearch using Data Prepper. For example, the incoming JSON looks like this:

```[

{

"id": "12345",

"name": "test user",

"action": "testing"

}

]
```

After indexing, the document appears in OpenSearch as:
```
{

“took”: 1,

“timed_out”: false,

“_shards”: {

"total": 1,

"successful": 1,

"skipped": 0,

"failed": 0

},

“hits”: {

"total": {

  "value": 1,

  "relation": "eq"

},

"max_score": 1,

"hits": \[

  {

    "\_index": "data-prepper-0001",

    "\_id": "12345",

    "\_score": 1,

    "\_source": {

      "id": "12345",

      "name": "test user",

      "action": "testing"

    }

  }

\]

}

}
```

As you can see, the id field is used both as the document ID (_id) and is also stored in the document body (_source), causing duplication. I’m looking for a way to use the id field as the document ID without storing it in the _source to reduce storage overhead.

Anthony · September 1, 2025, 11:14am

@Aravinth Thank you for posting the question. Can you please surround your configuration blocks in code blocks to make it easier to view.

Regarding the ID field duplication, have you tried removing it in the pipelline using processor, see example below:

processor:
    - add_entries:
        entries:
          - metadata_key: "doc_id"
            value_expression: "/id"
    - delete_entries:
        with_keys: ["id"]

Aravinth · September 1, 2025, 11:45am

version: "2"
log-pipeline:
  source:
    http:
      path: /log/ingest
      port: 2021

  processor:
    - delete_entries:
        with_keys: ["id"]

  sink:
    - opensearch:
        hosts:
          - "https://opensearch-node1:9200"
        index: "data-prepper-0001"
        username: admin
        password: password
        document_id: ${/id}
        insecure: true

@Anthony this was my configuration. I tried removing also. but when I removed that key its removing before that json reaching the sink and its raising an error like Document failed to write to OpenSearch with error code 400. Configure a DLQ to save failed documents. Error: if _id is specified it must not be empty.

Anthony · September 1, 2025, 12:23pm

@Aravinth can you give example of document that you are trying to index?

Can you try the following processor:

  processor:
    - add_entries:
        entries:
          - metadata_key: "doc_id"
            value_expression: "/id"
    - delete_entries:
        with_keys: ["id"]

Aravinth · September 1, 2025, 1:48pm

version: "2"
log-pipeline:
  source:
    http:
      path: /log/ingest
      port: 2021

processor:
    - add_entries:
        entries:
          - metadata_key: "doc_id"
            value_expression: "/id"
    - delete_entries:
        with_keys: ["id"]
  sink:
    - opensearch:
        hosts:
          - "https://opensearch-node1:9200"
        index: "data-prepper-0001"
        username: admin
        password: Aravinth@31
        document_id: "${getMetadata("doc_id")}"
        insecure: true

@Anthony thanks for sharing the above configuration. initially I tried with getMetadata(\"doc_id\") . Now changed to getMetadata("doc_id"). The final configuration worked fine by getting the doc_id from metadata.

Topic		Replies	Views
Custom document IDs via Logstash output plugin OpenSearch	0	656	February 13, 2023
Can I config OpenSearch to skip the same-ID-check Open Source Elasticsearch and Kibana discuss , configure , feature-request	0	601	January 26, 2023
Disable automatic ID generation on OpenSearch OpenSearch index-management	2	101	January 13, 2025
Special Characters in generated _id field OpenSearch	5	367	October 4, 2024
Unrelated Documents Update on specific document update OpenSearch Client Libraries opensearch-js	2	174	June 22, 2024

How to Remove Document ID Field After Using It as `document_id` in opensearch sink option

Related topics