OpenDistro cluster becomes unstable after losing a node

I’ve got a 22 node cluster spanning 2 data centers, with a high bandwidth (100+Gbps) and low latency (<1ms ping time) connection between the 2. We’re moving servers from 1 DC to the other, so we have to shut them down to accomplish this. I’ve only been stopping Elasticsearch on 1 node and waiting for the cluster to rebalance before moving on to the next one. When we were using Elastic’s version 6.8 with ReadOnlyRest there were no problems with this procedure. Since we’ve switched to the latest release from OpenDistro, the cluster becomes unstable when we take a node out of service.

While it’s rebalancing, other nodes will leave the cluster for some reason. I’m seeing messages in the logs like

[2021-11-30T21:05:21,187][WARN ][o.e.c.c.C.CoordinatorPublication] [HOSTNAME]1 after [30s] publication of cluster state version [676169] is still waiting

Instead of the cluster simply staying Yellow while the missing shards are recreated, it goes back and for from Yellow to Red as nodes leave and rejoin the cluster. This means instead of taking less than an hour to recover, it takes several hours.

Any ideas on why this would be happening?


IIRC, this isn’t a thing about Opensearch specifically, but changes made in Elasticsearch 7.
We’ve encountered the same issue a while back so forgive me if I am making a few mistakes here - Elasticsearch 7 changed the way nodes coordinate and leave in a few ways and added a lag timeout - when nodes can’t complete operations after some set time, they leave and return.

It tripped me up quite a bit at first, until with further testing I realized that before I had non-responding nodes, it’s just that nothing was done about it.

Will try to refind all the sources if I have the time, it took quite a bit of digging when I last debugged it and I believe it also took us looking at Elasticsearch source

Ah, so the silent failure is now a reported failure. Got it.


Not so much reported, as handled.
Before you would have finished the operation in an hour, following with a day of rebalancing.
Now the nodes leave and rejoin, which is… A way to handle it. Not the best one, but a way. I guess since Elasticsearch can’t cancel write tasks, that’s how they chose to handle a node not responding due to write overload (in essence, completely stop the handling node which does actually cancel these tasks).

I would suggest trying to increase a few values in the cluster settings if you do have high node bandwidth, using transient config to test it out:
indices.recovery.max_bytes_per_second (remember to leave some spare overhead on this one)

Same here, I don’t know why it happens after long time running smooth :cry:

From data node:

[2022-01-11T11:12:03,656][WARN ][o.e.c.s.ClusterApplierService] [data-2-102] cluster state applier task [ApplyCommitRequest{term=28, version=2451480, sourceNode={master-49-254}{DM1C14E9TzapBYMfa9s15w}{UfNP6laXRrqp9ft31SSn9g}{}{}{mr}}] took [39.6s] which is above the warn threshold of [30s]..............

Master node

[2022-01-11T11:21:33,226][INFO ][o.e.c.c.C.CoordinatorPublication] [master-49-254] after [10s] publication of cluster state version [2451500] is still waiting for {-data-2-21}{h_c2M14OTsyutTlPoOe0hA}{d20RC6nNQJm4rP7fsZFAMA}{}{}{dr} [SENT_APPLY_COMMIT]

Also I see some time info about removed / added node, it creates un-stable cluster

[2022-01-11T10:23:28,903][INFO ][o.e.c.s.ClusterApplierService] [master-49-211] removed
 {{data-2-102}{e7D07IGtTZmAqsoAKzZBlQ}{KAnJ_0XWQ4SGMpKi0KljBA}{}{}{dr}}, term: 28, version: 2451250, reason: ApplyCommitRequest{term=28...

[2022-01-11T10:23:32,818][INFO ][o.e.c.s.ClusterApplierService] [master-49-211] added {{data-2-102}{e7D07IGtTZmAqsoAKzZBlQ}{KAnJ_0XWQ4SGMpKi0KljBA}{}{}{dr}}, term: 28, version: 2451252, reason: ApplyCommitRequest{term=28........

which lead cluster become yellow/red, sometimes it happens to several nodes. :crying_cat_face:

From the little detail I can gather from these warnings, it seems your nodes are sending the data slower than the nodes can write it to disk - investigate and benchmark your node iops / disk throughput

I used iotop command to watch, it seems like random from 2M/s → 40M/s.
That is why I don’t think it can related to iops.

Also My data node is like 20 nodes with 16c_32g spec.
Each node contain 3 disk, each disk is 2Tb.

Data Avg per day ( with codec: best compression ) is about 4Tb

I think i have to upgrade to Opensearch. ( I still use ODFE latest version ) :crying_cat_face:

That’s throughput, not iops. Benchmarking writes is really hard, scaling it is hard as well.
We personally tuned ZFS under the hood to actually bulk our requests in RAM and write them at once in big blocks in order to improve iops to the level needed - it has as much to do with your disks/fs/caches/raid/etc… As with Opensearch/ODFE. I doubt upgrading will actually solve the issue, and it’ll be the same with Elastic Co. Solutions.

1 Like

Thanks for sharing. Nice to know bulk requests can handle like this way :kissing_heart: :kissing_heart: :kissing_heart: