Versions (OpenSearch/Server OS): 3.4.0/Rocky Linux 9.4
Describe the issue:
I’d like to restart a node without data being shuffled around in the cluster. I tried:
- index.unassigned.node_left.delayed_timeout: 5m (doesn’t work at all, I get unassigned shards immediately with unassigned.reason=NODE_LEFT)
- cluster.routing.allocation.cluster_concurrent_rebalance: 0 (triggers shard initializations)
- cluster.routing.allocation.enable: none (triggers shard relocations when re-enabled)
Note: my settings are
cluster.routing.allocation.balance.index": 1.0
cluster.routing.allocation.balance.threshold": 1.0
cluster.routing.allocation.balance.shard": 0
I also tried with the default of 0.55 for index.
Configuration: 115 di, 5m
All my tests were done with zero reads and zero writes on the cluster.
Is it even possible? I can work with that but I’d like to understand why it’s so hard…
Thank you!
Related Q: Is there a difference between a relocation and an initialization? Both show up in the _cat/recovery endpoint and consume the same amount of network/CPU…
@Camusensei thank you for the question, but perhaps I misunderstood the issue. Your first attempt was correct using
index.unassigned.node_left.delayed_timeout
The shards will of course go into unassigned state, but they will not be moved around until after the time delay has expired. The initialisation after restart is also necessary and should be very quick, as no data is actually being moved around.
Can you elaborate on the issue you are seeing if the above is not addressing the question?
Thank you for your reply, Anthony!
I think the misunderstanding was on my side as I was confused between relocation and initialization. Thanks to your input, I finally understood the following, please let me know if I’m wrong.
- An initialization is technically the same as a relocation except that the initialization does not remove the shard on the source node
- An initialization will make use of the data stored on the node if any, avoiding a full data transfer if a smaller update is enough
- When a node is stopped, all primary shards on that node have one of their replicas become the primary shard immediately (regardless of delayed_timeout)
- When a node restarts, it will get reassigned the same shards it used to have
- Only after delayed_timeout will the cluster start to fix the replica shortage. It does so by initializing new shards on other nodes… Which can then trigger relocations if those nodes become overloaded
If I’m correct, it means that having a node restart faster than delayed_timeout allows the master to reassign the same shards to the node leading to no data moving around needlessly, and this is why you told me my first attempt was correct.
@Camusensei yes, that is correct, I would only add a slight clarification on the very first point you mentioned. By the time an initialization happens the source is already gone (UNASSIGNED). Therefore I would frame it as: during relocation the source actively stays online and serving reads until the target is ready, during initialization there is no source copy at all.