Data Prepper 1.5.0 - Log ingestion from S3

The Data Prepper maintainers released Data Prepper 1.5.0 last week. It adds support for ingesting log data from S3 objects. It discovers these objects using an SQS queue for S3 Event Notifications.

There are more details available in the launch blog post:

Does Data Prepper support multiple instances reading from S3 and not stepping on the other instances’ toes? I.e. By using SQS for notifications of new S3 events will it ensure that each Data Prepper instance does not re-process the same S3 file?

@searchspark2310 , That’s a great question. SQS will help with this. Each Data Prepper node will read from SQS and the messages will generally not be duplicated. SQS does provide for at-least-once delivery, so there may be cases where the same message is delivered twice. In those situations, OpenSearch would have duplicate records.

Is strict avoidance of duplicates a requirement for your use-case?

Data Prepper has an aggregation processor that can be used to deduplicate events if there is a key that can be used for uniqueness. It currently only works in single-node deployments. The Core Peer Forwarding feature is a proposal that would allow it to work on multiple nodes. Feel free to comment on these issues, or create a new issue if there is something specific you would like.

Yes avoidance of duplicate records is a requirement for us unfortunately. I don’t think the aggregation processor will be of much help either since you stated it must be ran on the same node.

Backstory on why I’m asking this. We ran into a situation where we used logstash’s S3 input plugin (the basic one) to have one node process files from s3 but it could not keep up with the high rate at which the s3 logs file were being generated. And since the s3 logstash input plugin can only be ran on one node it could not be a solution for us.

We were hoping that this data prepper plugin would solve our problem and have the ability to be ran on multiple nodes for scalability.

Reading more into SQS, we may give the FIFO SQS type a try where it’s a “Exactly-Once” delivery method instead of “At least once”. More to come.

@searchspark2310 , I don’t believe that S3 Event Notifications can send events to an SQS FIFO queue. Also, Data Prepper’s S3 sink doesn’t support reading from FIFO queues. If you believe that adding this support would help, please let us know. We can discuss what we might be able to do.