Dataprepper read line with json and text

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Describe the issue:
Hi,
I am trying to read some source data that has a mix of Integer Date and Json. For example
213123131 2023-08-02T23:56:00.000Z (“key”:{“key1” : “value1”}}
213123132 2023-08-02T23:56:00.000Z (“key2”:{“key3” : “value1”}}



213123132 2023-08-02T23:56:00.000Z (“keyX”:{“keyY” : “valueZ”}}

I am trying to understand what are the best steps to use in the processor to extract the json as Json and not as a string. Right now it gets passed onto Opensearch as a String.

Thanks

Configuration:

Relevant Logs or Screenshots:

Hello @sputmayer ,

Thank you for your interest in Data Prepper. This should be possible using the grok processor. And then, if you’d like to extract the JSON, you can use the parse_json processor.

First, I noticed that your JSON lines start with a parenthesis instead of curly braces.

213123132 2023-08-02T23:56:00.000Z (“key2”:{“key3” : “value1”}}

I’m going to assume this is a copy-paste error. If not, we can discuss further on solutions.

OK. Let’s get into the solution.

If we just use grok, you can have a configuration similar to the following.

grok-pipeline:
  source:
    file:
      path: /usr/share/test.log
      record_type: event
  processor:
    - grok:
        match:
          message: ['%{INT:number:int} %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:json}']
  sink:
    - stdout:

The key part is this grok pattern: '%{INT:number:int} %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:json}'. It will look for an integer value, then an ISO-8601 timestamp. Finally, it gets the rest of the data and puts it into a field named json.

If you run this, you will get output like the following:

{"message":"213123131 2023-08-02T23:56:00.000Z {\"key\":{\"key1\" : \"value1\"}}","number":213123131,"json":"{\"key\":{\"key1\" : \"value1\"}}","timestamp":"2023-08-02T23:56:00.000Z"}
{"message":"213123132 2023-08-02T23:56:00.000Z {\"key2\":{\"key3\" : \"value1\"}}","number":213123132,"json":"{\"key2\":{\"key3\" : \"value1\"}}","timestamp":"2023-08-02T23:56:00.000Z"}

You will see that the json key has the JSON as a string. If you’d like to parse you can use the parse_json processor. This next example completes that, and also deletes the original message and the JSON string which was found in json above.

grok-pipeline:
  source:
    file:
      path: /usr/share/test.log
      record_type: event
  processor:
    - grok:
        match:
          message: ['%{INT:number:int} %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:json}']
    - parse_json:
        source: json
    - delete_entries:
        with_keys:
          - message
          - json
  sink:
    - stdout:

Running this on your input yields:

{"number":213123131,"timestamp":"2023-08-02T23:56:00.000Z","key":{"key1":"value1"}}
{"number":213123132,"timestamp":"2023-08-02T23:56:00.000Z","key2":{"key3":"value1"}}

You can use this a starting point to manipulate the data as you see necessary from here. Feel free to reach out with more questions.

Thanks, I was able to make it work but this provides a much better understanding. Appreciate it.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.