OpenSearch cluster *very* high transport data transfer

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.11.0

Describe the issue:

We have discovered that there is a very large amount of traffic between the nodes running our OpenSearch cluster. Almost 70TB per month is transferred between the nodes. Since we’re running in AWS and the cluster is spread across availability zones this means that the data transfer costs almost $700 extra per month! This seems quite excessive for a cluster that ingests about 100 GB per day…

The cluster consists of 3 data nodes and 3 dedicated manager nodes. All incoming traffic (query, write) is load balanced across the data nodes.

Most nodes seems to send/receive at a baseline about 2MB/s, with a spike of around 4-6 MB/s during the days (makes sense as most log traffic would be written now)

I have noticed that the manager nodes are responsible for a surprisingly large part of the traffic. The leader node continuously sending about 2MB/s and the other managers sending about 1MB/s at all times.

Is this normal? I mean it seems extremely excessive that the cluster replication traffic is larger than the indexed data by a factor of 20x. If not any ideas where to start looking to try and figure out what may be the cause?

I do think that some of the mappings in the cluster is very un-optimal, and we have a lot more indices than we should… so I think the cluster state is a bit larger than it could be, but could it really cause this much overhead? (/_cluster/state is almost 30MB uncompressed/pretty and 1.5MB gzipped.)

Hey @albgus ,

Do you by any chance use segment replication [1] for some or all indices?
Thank you.

[1] Segment replication - OpenSearch documentation

@reta No, we’re have DOCUMENT as the cluster default replication.type so that shouldn’t be it.

I did find something after some more investigation though; we have the prometheus exporter plugin for OpenSearch installed in the cluster. After reading this post about Elasticsearch I decided to investigate that plugin. Turns out pausing scraping reduced the network traffic significantly, especially from the master nodes. Some more investigation it seems that the main culprit is the prometheus.indices setting so I will disable that and see how the traffic is affected the coming days.

Thanks @albgus , indeed the plugins (like prometheus one) could be very chatty, glad you have a suspect.