OpenSearch cluster very high transport data transfer

albgus · January 3, 2024, 10:45am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.11.0

Describe the issue:

We have discovered that there is a very large amount of traffic between the nodes running our OpenSearch cluster. Almost 70TB per month is transferred between the nodes. Since we’re running in AWS and the cluster is spread across availability zones this means that the data transfer costs almost $700 extra per month! This seems quite excessive for a cluster that ingests about 100 GB per day…

The cluster consists of 3 data nodes and 3 dedicated manager nodes. All incoming traffic (query, write) is load balanced across the data nodes.

Most nodes seems to send/receive at a baseline about 2MB/s, with a spike of around 4-6 MB/s during the days (makes sense as most log traffic would be written now)

I have noticed that the manager nodes are responsible for a surprisingly large part of the traffic. The leader node continuously sending about 2MB/s and the other managers sending about 1MB/s at all times.

Is this normal? I mean it seems extremely excessive that the cluster replication traffic is larger than the indexed data by a factor of 20x. If not any ideas where to start looking to try and figure out what may be the cause?

I do think that some of the mappings in the cluster is very un-optimal, and we have a lot more indices than we should… so I think the cluster state is a bit larger than it could be, but could it really cause this much overhead? (/_cluster/state is almost 30MB uncompressed/pretty and 1.5MB gzipped.)

reta · January 3, 2024, 1:14pm

Hey @albgus ,

Do you by any chance use segment replication [1] for some or all indices?
Thank you.

[1] Segment replication - OpenSearch documentation

albgus · January 4, 2024, 3:05pm

@reta No, we’re have DOCUMENT as the cluster default replication.type so that shouldn’t be it.

I did find something after some more investigation though; we have the prometheus exporter plugin for OpenSearch installed in the cluster. After reading this post about Elasticsearch I decided to investigate that plugin. Turns out pausing scraping reduced the network traffic significantly, especially from the master nodes. Some more investigation it seems that the main culprit is the prometheus.indices setting so I will disable that and see how the traffic is affected the coming days.

reta · January 4, 2024, 4:04pm

Thanks @albgus , indeed the plugins (like prometheus one) could be very chatty, glad you have a suspect.

Topic		Replies	Views
Search request rate imbalance OpenSearch troubleshoot	2	250	July 10, 2024
Performance for a high throughput OpenSearch cluster (200TB/day) OpenSearch	6	1708	July 26, 2023
Clarification question: To which nodes do I send traffic? OpenSearch discuss , configure , install	1	291	December 22, 2023
Node frequently leaving and joining the Cluster OpenSearch troubleshoot	0	715	November 3, 2022
High cpu on data nodes OpenSearch troubleshoot	4	380	August 20, 2024

OpenSearch cluster *very* high transport data transfer

Related topics

OpenSearch cluster very high transport data transfer