I learned yesterday about Data Prepper. I read the documentation on it, peered into the code a bit to try to figure out what it does and the use case.
On the last community meeting it was mentioned that Data Prepper is like Logstash.
From what I gather, Logstashis geared towards one data type: Documents, while Data Prepper aims to support multiple data types (Metric data-sample, Document, Trace, etc).
My main question is comparison with existing open source products in that area, to understand the difference between them: From a brief overview, OpenTelemtry Collector and Data Prepper are geared towards the same goal: Collect (push/pull) telemetry data (data types I mentioned above), transform it (stateful) and push it some where.
How do you see those differences and what motivated the creation of it? Do you see those project converge once OTLP become fully GA in all its aspects?
Hello @asafmesika , Thanks for raising this great question!
First, I’m not sure I would say that Data Prepper is like Logstash. The recent release of 1.2.0 introduces log ingestion features which are similar to some Logstash features. Currently Data Prepper supports trace and log ingestion. I hope to see metric data as well in the near future, and there is an issue to support it.
Data Prepper is aiming to support OTel standards similar to the OTel Collector. But, Data Prepper aims to work with other data types and we have a generic event model which is not tied to the OTel specification. We consider OTel an important standard, but do not require it. For example, with log ingestion, users can send logs straight from FluentBit to Data Prepper.
I see Data Prepper and the OTel Collector as being complimentary tools to each other. Indeed many of our examples consider the scenario of an OTel Collector agent sending traces to Data Prepper. Data Prepper currently supports stateful processing of trace data and upcoming features for stateful aggregation. It is distributed with a concept of peers to support this stateful work.
I hope this helps answer your question. I’m sure there is more to discuss, and I’m happy to continue.
Thanks for the taking the time to answer David (@dlv). Couple of follow up questions here.
I know that OpenTelemtry Collector Processors can also support stateful processing, yet from it’s architecture wasn’t designed to scale out as you described. Can you elaborate on this and point to any references to learn more? More specifically, what algorithms and 3rd party systems does it require to do the coordination work (i.e. etc, ZK, …)?
Data Prepper currently has a peer-forwarder processor which routes Events to the node which should handle those events.
Data Prepper uses consistent hashing and a hash-ring to determine which node handles which events. Nodes in Data Prepper discovery each through one of three mechanisms: 1) static IP list; 2) DNS; 3) AWS CloudMap. The nodes are all identified by IP address. The current peer-forwarder hashes the traceId of the trace. So it is limited only to trace events.
GitHub issue #700 is an RFC to generalizing the peer-forwarding capability. It will still use the same hash-ring approach and the same service discovery options. But, it will support configurable field names (beyond just traceId) and will be part of the core Data Prepper functionality. It will no longer be a processor.
One thing that is not discussed there, but I’d like to do is support a pluggable service-discovery mechanism. Right now, new service discovery options have to be added directly into the code through an enum.