Hi @asafmesika,
Nice to hear that you’re looking at improving shard allocation. While I’m not actively working on Shard Allocator these days, I’ve spent considerable time working in this space, and I’m happy to brainstorm.
I think both the options you presented are viable; I was considering the same two routes for persisting data. This data can grow very quickly, and you may soon realize that you want to add more signals. Which is why, I’m a little skeptical about the cluster state approach. Opensearch will distribute the delta in cluster state with every leader-follower check, and observability data like this will inevitably have some diffs to propagate. There is a non-trivial cost to sharing this data across nodes in a large cluster (which interestingly, is also where these optimized shard allocators would shine).
I like the idea of storing it in a local index. I think, this is already done for some plugins, maybe alerting and index management related?, (someone in the forum can confirm). We can weigh the pros and cons of each as you design this in more detail.
While we’re on the the topic, there are a few other things we should think about…
- Using actual resource footprint signals instead of proxy signals - Things like actual JVM, CPU, RAM, heap and network consumption for a shard may be strong indicators of the right node to place it.
- How frequently do we capture these signals and how? I think there is an existing NodeStats mechanism that could be leveraged…
- Seeding new shards with initial values of these signals - we could use exponential moving averages to seed initial weights values for shards.
- How frequently and when do we change allocations or rebalance? Rebalancing is a disruptive operation, it requires moving data around and replaying the translog, while keeping writes in progress. This can take a toll on your cluster.
- To help with (3), people generally use throttling. Opensearch also has throttling limits on concurrent recoveries. But things get interesting when we need throttling to play game with our overall allocation plan…
- Say our algo realizes that it needs to move 10 shards around to different nodes. But throttling only lets us move 2 at a time. How will those decisions change after we’ve moved a subset of shards - we move 4 shards and the signals then suggest a different plan.
- Do we go with a greedy approach? Or do we execute the full initial plan for those 10 shards?
- Where do we store the plan for those 10 shards, to can come back to it as more shards open up to move based on throttling limits?
- It is incredibly useful to have a deterministic algo in shard allocation - to be able to work backwards and see why shards were placed where they were, or look at the state of the system and know which shards will move next. Something to consider while designing this.
- Also, it would help to add some commands, APIs, or tools that help visualize past and future allocation/rebalancing decisions.
- Ensure there is no flip-flopping - we don’t want to land in a place where shards keep bouncing from one node to another.
Finally, the algo. should be able to say when it is best to scale the cluster, instead of attempting to rebalance shards. And if that scaling should be vertical or horizontal.
All of these are hard and interesting aspects of this problem. I’d like to hear more on this from the community…