Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.11.0
Describe the issue:
We have discovered that there is a very large amount of traffic between the nodes running our OpenSearch cluster. Almost 70TB per month is transferred between the nodes. Since we’re running in AWS and the cluster is spread across availability zones this means that the data transfer costs almost $700 extra per month! This seems quite excessive for a cluster that ingests about 100 GB per day…
The cluster consists of 3 data nodes and 3 dedicated manager nodes. All incoming traffic (query, write) is load balanced across the data nodes.
Most nodes seems to send/receive at a baseline about 2MB/s, with a spike of around 4-6 MB/s during the days (makes sense as most log traffic would be written now)
I have noticed that the manager nodes are responsible for a surprisingly large part of the traffic. The leader node continuously sending about 2MB/s and the other managers sending about 1MB/s at all times.
Is this normal? I mean it seems extremely excessive that the cluster replication traffic is larger than the indexed data by a factor of 20x. If not any ideas where to start looking to try and figure out what may be the cause?
I do think that some of the mappings in the cluster is very un-optimal, and we have a lot more indices than we should… so I think the cluster state is a bit larger than it could be, but could it really cause this much overhead? (/_cluster/state
is almost 30MB uncompressed/pretty and 1.5MB gzipped.)