Adding a new data node craters indexing rate

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.19.1 and 3.1.0

Describe the issue:

After adding a new server to the cluster, we eventually see a huge dropoff in the indexing rate. The queue for Logstash servers writing to Opensearch completely fills up and our available logs starts falling behind. Removing the new server(s) immediately fixes this situation.

We first noticed this when adding 3 new nodes. It happened again when adding a different node. All servers have the same specs and match other nodes in the cluster. It’s stable with 56 data nodes. We have another cluster with similar hardware with 63 nodes that’s never had this issue.
All 4 of the cursed nodes are in the same rack and on the same switch, but other servers in that rack and switch aren’t having problems.

Upgrading from 2.19.1 to 3.1.0 didn’t help. We’re using segment replication on 600+ indices, so the bug in 3.2.0 and 3.3.1 prevents us from upgrading to those.

We finally got the Prometheus exporter installed and after the most recent test of adding 1 node the opensearch_fs_io_total_write_operations metric jumped, as expected, but it stayed higher than any other node in the cluster.

Just to add to the mystery, the dropoff isn’t predictable. It’s happened after ~45 minutes, ~4 hours, ~16 hours, and ~38 hours.

@reshippie To get more insight, when this happens again are you able to run the below commands:

GET _cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,master,disk.used_percent,disk.avail
GET _nodes/stats/fs,indices,thread_pool,os,jvm?pretty
GET _cat/thread_pool/write?v&h=node_name,active,queue,rejected,completed
GET _nodes/hot_threads?threads=3&ignore_idle_threads=true

Can you also check for any shards being reallocating using the below

GET _cat/shards?h=index,shard,prirep,state,node

And run allocation/explain api on the shards in question:

POST _cluster/allocation/explain
{
  "index": "index_name",
  "shard": 0,
  "primary": true
}

How much data is being ingested into this cluster when you add a node?

Usually when a node is added there can be a lot of reallocation taking place and if heavy indexing is done at the same time this can case issues, but since in your case the problem can take up to 38 hours to surface. This seems to point to a problem with disk on those for nodes, as opposed to the actually operation of the OpenSearch.

I’m not sure this is an Opensearch issue, but the fact that it’s not limited to a single server has me considering everything. I’ve run out of good ideas, so I’m coming up with bad ones. I’ll try your commands during our next test.

Normally we’d be doing 150k/sec at that time of day. The graph for the 4 hours after things went bad became very spiky. A few peaks hit over 200k, but there were troughs down to 20k.

It turns out it was a hardware/OS issue. MD RAID on Debian 12 isn’t playing nice. All of our other servers with MD RAID are on Debian 11 and our servers on Debian 12 have HW RAID.