How to Remove Document ID Field After Using It as `document_id` in opensearch sink option

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Describe the issue:
I’ve configured a Data Prepper ingestion pipeline to index documents into OpenSearch. Initially, OpenSearch automatically generated the document ID for each ingested document. Later, I discovered a configuration option that allows using a specific field from the incoming JSON as the document ID. I updated the pipeline to use the id field by setting document_id: ${/id}. While this correctly assigns the document ID, the issue is that the id field is still included in the document body, resulting in duplication. This redundancy increases storage usage unnecessarily. I’m looking for a configuration or feature in Data Prepper that allows using a field as the document ID without indexing it in the document source. Is there a recommended way to achieve this?

Configuration:

```sink:

  • opensearch:

    hosts:

    - "
    

    index: “data-prepper-0001”

    username: admin

    password: password

    document_id: ${/id}

    insecure: true
    ```

Relevant Logs or Screenshots:
I’m ingesting JSON documents into OpenSearch using Data Prepper. For example, the incoming JSON looks like this:

```[

{

"id": "12345",

"name": "test user",

"action": "testing"

}

]
```

After indexing, the document appears in OpenSearch as:
```
{

“took”: 1,

“timed_out”: false,

“_shards”: {

"total": 1,

"successful": 1,

"skipped": 0,

"failed": 0

},

“hits”: {

"total": {

  "value": 1,

  "relation": "eq"

},

"max_score": 1,

"hits": \[

  {

    "\_index": "data-prepper-0001",

    "\_id": "12345",

    "\_score": 1,

    "\_source": {

      "id": "12345",

      "name": "test user",

      "action": "testing"

    }

  }

\]

}

}
```

As you can see, the id field is used both as the document ID (_id) and is also stored in the document body (_source), causing duplication. I’m looking for a way to use the id field as the document ID without storing it in the _source to reduce storage overhead.

@Aravinth Thank you for posting the question. Can you please surround your configuration blocks in code blocks to make it easier to view.

Regarding the ID field duplication, have you tried removing it in the pipelline using processor, see example below:

processor:
    - add_entries:
        entries:
          - metadata_key: "doc_id"
            value_expression: "/id"
    - delete_entries:
        with_keys: ["id"]
version: "2"
log-pipeline:
  source:
    http:
      path: /log/ingest
      port: 2021

  processor:
    - delete_entries:
        with_keys: ["id"]

  sink:
    - opensearch:
        hosts:
          - "https://opensearch-node1:9200"
        index: "data-prepper-0001"
        username: admin
        password: password
        document_id: ${/id}
        insecure: true

@Anthony this was my configuration. I tried removing also. but when I removed that key its removing before that json reaching the sink and its raising an error like Document failed to write to OpenSearch with error code 400. Configure a DLQ to save failed documents. Error: if _id is specified it must not be empty.

@Aravinth can you give example of document that you are trying to index?

Can you try the following processor:

  processor:
    - add_entries:
        entries:
          - metadata_key: "doc_id"
            value_expression: "/id"
    - delete_entries:
        with_keys: ["id"]
1 Like
version: "2"
log-pipeline:
  source:
    http:
      path: /log/ingest
      port: 2021

processor:
    - add_entries:
        entries:
          - metadata_key: "doc_id"
            value_expression: "/id"
    - delete_entries:
        with_keys: ["id"]
  sink:
    - opensearch:
        hosts:
          - "https://opensearch-node1:9200"
        index: "data-prepper-0001"
        username: admin
        password: Aravinth@31
        document_id: "${getMetadata("doc_id")}"
        insecure: true

@Anthony thanks for sharing the above configuration. initially I tried with getMetadata(\"doc_id\") . Now changed to getMetadata("doc_id"). The final configuration worked fine by getting the doc_id from metadata.

2 Likes