Failure capacity

coredump17 · January 19, 2024, 10:32am

Hi all,

this is not an issue but a question :). I have read through the documents but have a gap in my understand which i hope someone can help with.

Scenario: Lets say i have 4x servers/OS instances, with each shard having a replica of 1. Node 1 dies a sudden death.

What i think will happen once the node timeout has expired:

Any replicas that had a primary on node 1 will become primary.
masters nodes will assign other nodes 2,3,4 replicas of said shards
replica nodes that existed on node1 will be created on nodes 2,3,4

This assumes you have enough storage on nodes 2,3 & 4 to accommodate the ask.

Can this be changed so that only step 1 is completed and the cluster remains in a yellow state until node 1 has recovered, preventing me from running out of storage. Possibly using cluster.routing.allocation.enable : primary ?

If for instance node 1 will be down for long periods of time and i make the decision to allow the replicas to be pushed onto nodes 2,3,4 can that be forced?

thanks in advance!

gaobinlong · January 20, 2024, 2:41am

There is an index level setting can be used to delay the allocation of replica shards when a node leaves:

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}

, the default value is 1m, which means if a node leaves for more than 1 minute, the replicas shards will be relocated to other nodes, you can increase the value based on your needs.

Setting cluster.routing.allocation.enable to primaries also works, but even the left node returns, the unassigned replica shards cannot be allocated, you have to reset the setting to all to make the unassigned replicas shards to be allocated normally.

coredump17 · January 20, 2024, 12:54pm

thanks @gaobinlong . I forgot to mention, i have this value set already. If the node_left.delayed_timeout is set and expires, i assume all replicas will get re-assigned until the cluster hits its high water mark, likely causing the cluster to stop ingesting data ?

I am asking these questions to understand the correct level of redundancy required for a small cluster.ie: 4x nodes with 1 x replica. monitor storage in way that 3/4 = max (to allow for one node to fail).

gaobinlong · February 5, 2024, 4:35am

Yes, after the node_left.delayed_timeout expires, the replicas will be assigned to other nodes, but this time the setting cluster.routing.allocation.disk.watermark.low takes effect, which defaults to 85%, if the usage of the disk is above 85%, the replicas will still be unassigned, and after 5 failures of retrying relocating replicas, the unassigned replicas will always be unassigned even the usage of the disk is under 85%, at this time we need to retry the relocation manually by POST _cluster/reroute?retry_failed=true.

Topic		Replies	Views
Replica UNASSIGNED when primary goes down OpenSearch	0	13	June 4, 2025
Delay in re-sharding opeartions OpenSearch configure	1	541	September 23, 2022
Delaying allocation when a node leaves OpenSearch	6	1595	December 8, 2023
Cannot allocate replica shard to a node with version [7.10.2] since this is older than the primary version [1.2.4] OpenSearch upgrade	0	697	October 31, 2022
Shard fail while rolling upgrade cluster DevOps troubleshoot , upgrade	3	71	October 7, 2024

Failure capacity

Related topics