Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.19.1 and 3.1.0
Describe the issue:
After adding a new server to the cluster, we eventually see a huge dropoff in the indexing rate. The queue for Logstash servers writing to Opensearch completely fills up and our available logs starts falling behind. Removing the new server(s) immediately fixes this situation.
We first noticed this when adding 3 new nodes. It happened again when adding a different node. All servers have the same specs and match other nodes in the cluster. It’s stable with 56 data nodes. We have another cluster with similar hardware with 63 nodes that’s never had this issue.
All 4 of the cursed nodes are in the same rack and on the same switch, but other servers in that rack and switch aren’t having problems.
Upgrading from 2.19.1 to 3.1.0 didn’t help. We’re using segment replication on 600+ indices, so the bug in 3.2.0 and 3.3.1 prevents us from upgrading to those.
We finally got the Prometheus exporter installed and after the most recent test of adding 1 node the opensearch_fs_io_total_write_operations metric jumped, as expected, but it stayed higher than any other node in the cluster.
Just to add to the mystery, the dropoff isn’t predictable. It’s happened after ~45 minutes, ~4 hours, ~16 hours, and ~38 hours.
