Data Prepper extract record field and partition on it

laurynas · June 26, 2024, 12:53pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.8

Describe the issue:
I’d like to migrate data from OpenSearch to S3 sink and hive partition data on record’s field. Record’s fields (avro definition):

    {
      "type" : "record",
      "namespace" : "org.opensearch.dataprepper.examples",
      "name" : "Data",
      "fields" : [
        { "name" : "created", "type" : {"type" : "string", "logicalType" : "timestamp-micros"}},
        { "name" : "resource", "type": ["null", "string"]},
        { "name" : "response_payload", "type" : ["null", "string"]}
      ]
    }

The partition should happen on created field. For example below record:

{
  "created": "2023-01-05T12:35:24.139Z",
  "resource": "asdfsdafsdfa",
  "response_payload": "asdf"
}

Should go to S3 bucket “test/year=2023/month=1/day=5/”. Payload:

{
  "created": "2024-05-12T13:13:13.000Z",
  "resource": "asdfsdafsdfa",
  "response_payload": "asdf"
}

Should go to S3 bucket “test/year=2024/month=5/day=12/”.

Here is a link to Github issue which was resolved. However, documentation lacks clear guide how to properly extract record’s field and partition data on it.

Configuration:

version: "2"
opensearch-migration-pipeline:
  source:
    opensearch:
      acknowledgments: true
      # Provide an OpenSearch or Elasticsearch cluster endpoint.
      hosts: <host>
      indices:
        include:
          - index_name_regex: <regex>
      aws:
        region: <region>
        sts_role_arn: <role>
        serverless: false
  sink:
    - s3:
        aws:
          region: <region>
          sts_role_arn: <role>
        bucket: <bucket>
        object_key:
          path_prefix: test/year=${date_time_format(/created, "YYYY")}/month=${date_time_format(/created, "MM")}/day=${date_time_format(/created, "dd")}/"
        codec:
          parquet:
            schema: >
{
  "type" : "record",
  "namespace" : "org.opensearch.dataprepper.examples",
  "name" : "Data",
  "fields" : [
    { "name" : "created", "type" : {"type" : "string", "logicalType" : "timestamp-micros"}},
    { "name" : "resource", "type": ["null", "string"]},
    { "name" : "response_payload", "type" : ["null", "string"]}
  ]
}
        threshold:
          maximum_size: 10mb
          event_collect_timeout: PT1M
        compression: snappy

Relevant Logs or Screenshots:
Above configuration fails.

laurynas · July 10, 2024, 7:28am

Nevermind. This feature is not yet supported as per this Github Issue ticket.

system · September 8, 2024, 7:29am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Implemented Data Prepper to format log information to OpenSearch OpenSearch Dashboards	0	167	October 9, 2023
OpenSearch Sink Data Prepper discuss , configure	6	703	August 19, 2022
Perform data preprocessing for anomaly detector in opensearch Data Prepper discuss , configure , install	1	221	March 29, 2024
Data Prepper Distributions Update Data Prepper	1	420	February 7, 2023
How to use pipeline field vars to construct opensearch sink index param Data Prepper	4	676	July 4, 2024

Data Prepper extract record field and partition on it

Related Topics