I’ve got a 22 node cluster spanning 2 data centers, with a high bandwidth (100+Gbps) and low latency (<1ms ping time) connection between the 2. We’re moving servers from 1 DC to the other, so we have to shut them down to accomplish this. I’ve only been stopping Elasticsearch on 1 node and waiting for the cluster to rebalance before moving on to the next one. When we were using Elastic’s version 6.8 with ReadOnlyRest there were no problems with this procedure. Since we’ve switched to the latest release from OpenDistro, the cluster becomes unstable when we take a node out of service.
While it’s rebalancing, other nodes will leave the cluster for some reason. I’m seeing messages in the logs like
[2021-11-30T21:05:21,187][WARN ][o.e.c.c.C.CoordinatorPublication] [HOSTNAME]1 after [30s] publication of cluster state version [676169] is still waiting
for {HOSTNAME2}{ID}{ID}{IP2}{IP2:9300}{dimr} [SENT_APPLY_COMMIT], {HOSTNAME3}{ID}{ID}{IP3}{IP3:9300}{dimr} [SENT_PUBLISH_REQUEST]
Instead of the cluster simply staying Yellow while the missing shards are recreated, it goes back and for from Yellow to Red as nodes leave and rejoin the cluster. This means instead of taking less than an hour to recover, it takes several hours.
IIRC, this isn’t a thing about Opensearch specifically, but changes made in Elasticsearch 7.
We’ve encountered the same issue a while back so forgive me if I am making a few mistakes here - Elasticsearch 7 changed the way nodes coordinate and leave in a few ways and added a lag timeout - when nodes can’t complete operations after some set time, they leave and return.
It tripped me up quite a bit at first, until with further testing I realized that before I had non-responding nodes, it’s just that nothing was done about it.
Will try to refind all the sources if I have the time, it took quite a bit of digging when I last debugged it and I believe it also took us looking at Elasticsearch source
Not so much reported, as handled.
Before you would have finished the operation in an hour, following with a day of rebalancing.
Now the nodes leave and rejoin, which is… A way to handle it. Not the best one, but a way. I guess since Elasticsearch can’t cancel write tasks, that’s how they chose to handle a node not responding due to write overload (in essence, completely stop the handling node which does actually cancel these tasks).
I would suggest trying to increase a few values in the cluster settings if you do have high node bandwidth, using transient config to test it out:
cluster.routing.allocation.node_concurrent_recoveries
cluster.routing.allocation.node_initial_primary_recoveries
indices.recovery.max_bytes_per_second (remember to leave some spare overhead on this one)
Same here, I don’t know why it happens after long time running smooth
From data node:
[2022-01-11T11:12:03,656][WARN ][o.e.c.s.ClusterApplierService] [data-2-102] cluster state applier task [ApplyCommitRequest{term=28, version=2451480, sourceNode={master-49-254}{DM1C14E9TzapBYMfa9s15w}{UfNP6laXRrqp9ft31SSn9g}{10.3.49.254}{10.3.49.254:9300}{mr}}] took [39.6s] which is above the warn threshold of [30s]..............
Master node
[2022-01-11T11:21:33,226][INFO ][o.e.c.c.C.CoordinatorPublication] [master-49-254] after [10s] publication of cluster state version [2451500] is still waiting for {-data-2-21}{h_c2M14OTsyutTlPoOe0hA}{d20RC6nNQJm4rP7fsZFAMA}{10.5.2.21}{10.5.2.21:9300}{dr} [SENT_APPLY_COMMIT]
Also I see some time info about removed / added node, it creates un-stable cluster
From the little detail I can gather from these warnings, it seems your nodes are sending the data slower than the nodes can write it to disk - investigate and benchmark your node iops / disk throughput
That’s throughput, not iops. Benchmarking writes is really hard, scaling it is hard as well.
We personally tuned ZFS under the hood to actually bulk our requests in RAM and write them at once in big blocks in order to improve iops to the level needed - it has as much to do with your disks/fs/caches/raid/etc… As with Opensearch/ODFE. I doubt upgrading will actually solve the issue, and it’ll be the same with Elastic Co. Solutions.