Cluster was running fine and randomly nodes losing connection

@abarocco from what I have seen everything is configured correctly.

There are a few more things you might want to check, such as

  • Configuration on the actual nodes in opensearch.yml are all configured properly.
  • The hostnames are all different.

If all of this looks correct, I would recommend to post an update on the github issue with as much detail as possible. Also, if possible, rebooting the nodes might be a good idea.

ok @Anthony thanks you so much for your patience and help. Last thing, when u say rebooting the node you mean the opensearch services (already tried) or the virtual machine?

I would recommend to reboot virtual machines.

@Anthony Hello Anthony. I’ve tried with the reboot and again problem persist.

In my case this a POC Cluster so i’ve tried to delete many old indices . The previous situations was 9 DATANODE with around 1300 shard each. With 2/3TB Data each node.

Reducing the number of shards by deleting the indexes, at the moment there are about 750 shards each, the problem seems not to occur anymore. Do you think that the problem was caused by too many shards on the nodes?

But if this was the reason, don’t you think that in the gc logs we saw something like this?

@Anthony how many shard for nodes you recommend and which size for it?
my difficulty is that i have 9 data nodes with 12TB of storage each as i have to manage for 15 days of retention. 5TB of daily logs

The recommendation is to have no more than 50GB per shard (preferably less), and maximum 20 shards per 1 GB of heap on the node. If these are not read intensive, reducing the replicas to 1 would also be my suggestion, if not done so already.