Cluster was running fine and randomly nodes losing connection

Anthony · June 24, 2025, 12:53pm

@abarocco from what I have seen everything is configured correctly.

There are a few more things you might want to check, such as

Configuration on the actual nodes in opensearch.yml are all configured properly.
The hostnames are all different.

If all of this looks correct, I would recommend to post an update on the github issue with as much detail as possible. Also, if possible, rebooting the nodes might be a good idea.

abarocco · June 24, 2025, 12:55pm

ok @Anthony thanks you so much for your patience and help. Last thing, when u say rebooting the node you mean the opensearch services (already tried) or the virtual machine?

Anthony · June 24, 2025, 1:09pm

I would recommend to reboot virtual machines.

abarocco · June 25, 2025, 12:26pm

@Anthony Hello Anthony. I’ve tried with the reboot and again problem persist.

In my case this a POC Cluster so i’ve tried to delete many old indices . The previous situations was 9 DATANODE with around 1300 shard each. With 2/3TB Data each node.

Reducing the number of shards by deleting the indexes, at the moment there are about 750 shards each, the problem seems not to occur anymore. Do you think that the problem was caused by too many shards on the nodes?

But if this was the reason, don’t you think that in the gc logs we saw something like this?

abarocco · June 25, 2025, 1:11pm

@Anthony how many shard for nodes you recommend and which size for it?
my difficulty is that i have 9 data nodes with 12TB of storage each as i have to manage for 15 days of retention. 5TB of daily logs

Anthony · June 25, 2025, 2:51pm

The recommendation is to have no more than 50GB per shard (preferably less), and maximum 20 shards per 1 GB of heap on the node. If these are not read intensive, reducing the replicas to 1 would also be my suggestion, if not done so already.

Topic		Replies	Views
Logstash loses connection to OpenSearch periodically OpenSearch	5	2104	November 19, 2024
Node timeout/crash, will not resync, has to be killed OpenSearch troubleshoot	3	297	December 11, 2025
org.opensearch.transport.ConnectTransportException: [opensearch-cluster-master-18][10.42.2.20:9300] connect_timeout[30s] OpenSearch troubleshoot , configure	10	1947	May 30, 2022
Unstable cluster and lagging non-stop however 30 nodes with 40 GB memory each OpenSearch troubleshoot	4	210	June 11, 2025
Opensearch launch failed on cluster manager OpenSearch	4	2361	October 4, 2023

Cluster was running fine and randomly nodes losing connection

Related topics