Failure capacity

Hi all,

this is not an issue but a question :). I have read through the documents but have a gap in my understand which i hope someone can help with.

Scenario: Lets say i have 4x servers/OS instances, with each shard having a replica of 1. Node 1 dies a sudden death.

What i think will happen once the node timeout has expired:

  1. Any replicas that had a primary on node 1 will become primary.
  2. masters nodes will assign other nodes 2,3,4 replicas of said shards
  3. replica nodes that existed on node1 will be created on nodes 2,3,4

This assumes you have enough storage on nodes 2,3 & 4 to accommodate the ask.

Can this be changed so that only step 1 is completed and the cluster remains in a yellow state until node 1 has recovered, preventing me from running out of storage. Possibly using cluster.routing.allocation.enable : primary ?

If for instance node 1 will be down for long periods of time and i make the decision to allow the replicas to be pushed onto nodes 2,3,4 can that be forced?

thanks in advance!

There is an index level setting can be used to delay the allocation of replica shards when a node leaves:

PUT _all/_settings
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"

, the default value is 1m, which means if a node leaves for more than 1 minute, the replicas shards will be relocated to other nodes, you can increase the value based on your needs.

Setting cluster.routing.allocation.enable to primaries also works, but even the left node returns, the unassigned replica shards cannot be allocated, you have to reset the setting to all to make the unassigned replicas shards to be allocated normally.

thanks @gaobinlong . I forgot to mention, i have this value set already. If the node_left.delayed_timeout is set and expires, i assume all replicas will get re-assigned until the cluster hits its high water mark, likely causing the cluster to stop ingesting data ?

I am asking these questions to understand the correct level of redundancy required for a small 4x nodes with 1 x replica. monitor storage in way that 3/4 = max (to allow for one node to fail).

Yes, after the node_left.delayed_timeout expires, the replicas will be assigned to other nodes, but this time the setting cluster.routing.allocation.disk.watermark.low takes effect, which defaults to 85%, if the usage of the disk is above 85%, the replicas will still be unassigned, and after 5 failures of retrying relocating replicas, the unassigned replicas will always be unassigned even the usage of the disk is under 85%, at this time we need to retry the relocation manually by POST _cluster/reroute?retry_failed=true.