Data Prepper performance

Given my ingestion rate of 250GB per day from one source and that I want to support 100 such sources, what buffer_size and batch_size would be most appropriate for data-prepper? Alternatively, is data-prepper able to effectively process such a high volume of data?

I am thinking of keeping buffer_size = 1500000 and batch_size = 31250. Is there any downside for keeping such high buffer size?


If you have the memory available, keeping buffer size and batch size to a large value will have better performance, and will not have a downside. Here is an example of performance testing that was run for log ingestion (data-prepper/ at main · opensearch-project/data-prepper · GitHub). Note that the buffer size was 200,000 in this case, but 1.5M is fine too depending on the amount of memory available.

