This is an older question, and the OP may already have an answer, but I thought would share a few thoughts for those who might find it later.
The are a few things you need to know to size a cluster for logging use-cases…
- average indexed document size
This is not the same as the size of the raw log, and depends to what degree the log is parsed and how many values are extracted into individual fields). This can be determined by querying the ES REST API and doing a bit of math.
- peak ingest rate
There must be enough nodes to handle the peak ingest rates or back pressure can result in lost data.
- average ingest rate over 24 hours
This is used to calculate the total volume of index data per day.
- retention period for searchable data
If data must be searchable for longer periods, it may be necessary to use hot/warm architecture, or similar multi-tier strategy.
I will start with an assumption… logs are an average of 250 bytes, and the indexed size is 350 bytes.
100GB of logs would be 400 million logs per day, or 4630 logs/second. If we assume the peak is 50% greater than the average, the peak would be 6945/s. With appropriate hardware, in a multi-node cluster, the ingest rate per node is around 15000 logs/sec. So the peak ingest rate isn’t going to be an issue when sizing a cluster.
400 million logs per day at an average indexed size of 350 bytes per log, results in 140GB of data per day. Add to that one replica for redundancy, which gives us 280GB per day. The maximum recommended storage volume for a node to which data is actively written is 6-8TB. However, to avoid filling the disk completely, Elasticsearch will not allocate new shards on a volume that is over 85% of its capacity. So even using the 8TB size of a node, the real capacity available for Elasticsearch data is 6.8TB.
A three node cluster would thus provide 20.4TB of storage, or 73 days. Additional nodes would be required to extend the retention period beyond 73 days.
At these relatively low ingest rates, you shouldn’t require dedicated master nodes, as long as you don’t cut corners on the hardware. SSD storage is a MUST! 8 CPU cores and 64GB RAM is a minimum. 12 & 96GB would handle complex queries even better, with 16 & 128GB being even better. Beyond that there are diminishing returns and it is better to add more nodes rather than bigger nodes.
NOTE: there is a lot of potential exceptions here. But in most cases this should serve you well.