OpenDistro cluster becomes unstable after losing a node

reshippie · November 30, 2021, 9:30pm

I’ve got a 22 node cluster spanning 2 data centers, with a high bandwidth (100+Gbps) and low latency (<1ms ping time) connection between the 2. We’re moving servers from 1 DC to the other, so we have to shut them down to accomplish this. I’ve only been stopping Elasticsearch on 1 node and waiting for the cluster to rebalance before moving on to the next one. When we were using Elastic’s version 6.8 with ReadOnlyRest there were no problems with this procedure. Since we’ve switched to the latest release from OpenDistro, the cluster becomes unstable when we take a node out of service.

While it’s rebalancing, other nodes will leave the cluster for some reason. I’m seeing messages in the logs like

[2021-11-30T21:05:21,187][WARN ][o.e.c.c.C.CoordinatorPublication] [HOSTNAME]1 after [30s] publication of cluster state version [676169] is still waiting
for {HOSTNAME2}{ID}{ID}{IP2}{IP2:9300}{dimr} [SENT_APPLY_COMMIT], {HOSTNAME3}{ID}{ID}{IP3}{IP3:9300}{dimr} [SENT_PUBLISH_REQUEST]

Instead of the cluster simply staying Yellow while the missing shards are recreated, it goes back and for from Yellow to Red as nodes leave and rejoin the cluster. This means instead of taking less than an hour to recover, it takes several hours.

Any ideas on why this would be happening?

hagayg · December 1, 2021, 10:48am

IIRC, this isn’t a thing about Opensearch specifically, but changes made in Elasticsearch 7.
We’ve encountered the same issue a while back so forgive me if I am making a few mistakes here - Elasticsearch 7 changed the way nodes coordinate and leave in a few ways and added a lag timeout - when nodes can’t complete operations after some set time, they leave and return.

It tripped me up quite a bit at first, until with further testing I realized that before I had non-responding nodes, it’s just that nothing was done about it.

Will try to refind all the sources if I have the time, it took quite a bit of digging when I last debugged it and I believe it also took us looking at Elasticsearch source

reshippie · December 1, 2021, 9:37pm

Ah, so the silent failure is now a reported failure. Got it.

Thanks!

hagayg · December 2, 2021, 11:55am

Not so much reported, as handled.
Before you would have finished the operation in an hour, following with a day of rebalancing.
Now the nodes leave and rejoin, which is… A way to handle it. Not the best one, but a way. I guess since Elasticsearch can’t cancel write tasks, that’s how they chose to handle a node not responding due to write overload (in essence, completely stop the handling node which does actually cancel these tasks).

I would suggest trying to increase a few values in the cluster settings if you do have high node bandwidth, using transient config to test it out:
cluster.routing.allocation.node_concurrent_recoveries
cluster.routing.allocation.node_initial_primary_recoveries
indices.recovery.max_bytes_per_second (remember to leave some spare overhead on this one)

BlackMetalz · January 11, 2022, 4:27am

Same here, I don’t know why it happens after long time running smooth

From data node:

[2022-01-11T11:12:03,656][WARN ][o.e.c.s.ClusterApplierService] [data-2-102] cluster state applier task [ApplyCommitRequest{term=28, version=2451480, sourceNode={master-49-254}{DM1C14E9TzapBYMfa9s15w}{UfNP6laXRrqp9ft31SSn9g}{10.3.49.254}{10.3.49.254:9300}{mr}}] took [39.6s] which is above the warn threshold of [30s]..............

Master node

[2022-01-11T11:21:33,226][INFO ][o.e.c.c.C.CoordinatorPublication] [master-49-254] after [10s] publication of cluster state version [2451500] is still waiting for {-data-2-21}{h_c2M14OTsyutTlPoOe0hA}{d20RC6nNQJm4rP7fsZFAMA}{10.5.2.21}{10.5.2.21:9300}{dr} [SENT_APPLY_COMMIT]

Also I see some time info about removed / added node, it creates un-stable cluster

[2022-01-11T10:23:28,903][INFO ][o.e.c.s.ClusterApplierService] [master-49-211] removed
 {{data-2-102}{e7D07IGtTZmAqsoAKzZBlQ}{KAnJ_0XWQ4SGMpKi0KljBA}{10.5.2.102}{10.5.2.102:9300}{dr}}, term: 28, version: 2451250, reason: ApplyCommitRequest{term=28...

[2022-01-11T10:23:32,818][INFO ][o.e.c.s.ClusterApplierService] [master-49-211] added {{data-2-102}{e7D07IGtTZmAqsoAKzZBlQ}{KAnJ_0XWQ4SGMpKi0KljBA}{10.5.2.102}{10.5.2.102:9300}{dr}}, term: 28, version: 2451252, reason: ApplyCommitRequest{term=28........

which lead cluster become yellow/red, sometimes it happens to several nodes.

hagayg · January 11, 2022, 8:17am

From the little detail I can gather from these warnings, it seems your nodes are sending the data slower than the nodes can write it to disk - investigate and benchmark your node iops / disk throughput

BlackMetalz · January 11, 2022, 8:22am

I used iotop command to watch, it seems like random from 2M/s → 40M/s.
That is why I don’t think it can related to iops.

Also My data node is like 20 nodes with 16c_32g spec.
Each node contain 3 disk, each disk is 2Tb.

Data Avg per day ( with codec: best compression ) is about 4Tb

I think i have to upgrade to Opensearch. ( I still use ODFE latest version )

hagayg · January 11, 2022, 9:48am

That’s throughput, not iops. Benchmarking writes is really hard, scaling it is hard as well.
We personally tuned ZFS under the hood to actually bulk our requests in RAM and write them at once in big blocks in order to improve iops to the level needed - it has as much to do with your disks/fs/caches/raid/etc… As with Opensearch/ODFE. I doubt upgrading will actually solve the issue, and it’ll be the same with Elastic Co. Solutions.

BlackMetalz · January 11, 2022, 10:19am

Thanks for sharing. Nice to know bulk requests can handle like this way

Topic		Replies	Views
Cluster response time for queries getting high OpenDistro	3	298	February 6, 2024
Reindex API Unexpected Timeouts OpenDistro	1	2157	December 29, 2021
Dropping 1 node of cluster results unstable cluster and all shards being unassigned OpenSearch troubleshoot	3	737	December 2, 2024
Delay in re-sharding opeartions OpenSearch configure	1	545	September 23, 2022
Opendistro upgrade challenge from 7.7.0(sec 1.8.0) to 7.10.2 (1.13.1) Open Source Elasticsearch and Kibana discuss , troubleshoot	2	428	March 25, 2022

OpenDistro cluster becomes unstable after losing a node

Related topics