Improve the data nodes and shards configuration for performance

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 1.2.3

Describe the issue: I have been tasked with improving the performance and stability of an OpenSearch cluster and I am trying to understand the following:

  1. Is it detrimental to the performance of the cluster if data nodes do not have equal size of data on each i.e. some have 70-80%+ and others 20-30%?
  2. Why are the data nodes at so high of RAM percentage (95+ or 100) the whole time and is that detrimental to performance?
  3. Currently there are 23k+ shards on the cluster. They have sizes spanning from a few KBs to 1100MB+. Does the amount of shards and this huge of a difference in their size detrimental to performance?
  4. Do you see overall in the configuration and screenshots anything else that seems wrong to you that can be problematic?

Configuration:
data node:
replicas: 28
heap_size: “10240m”
storage: “200Gi”
resources:
limits:
cpu: ‘1’
memory: 12960Mi
requests:
cpu: 500m
memory: 10240Mi

max shards per node: 2000

Relevant Logs or Screenshots:

  1. Usually it is, just because more data typically means queries on these shards are more expensive, so you’ll have imbalanced load.
  2. It’s because OpenSearch will (by default) memory map the indices. So if indices are more than your RAM, you’d normally see this behavior and it’s OK, it’s just the OS using RAM as caches.
  3. The difference in shard sizes isn’t a problem in itself, it’s just that:
  • if shards of different sizes get allocated to different nodes (see 1), you typically have imbalanced load
  • if you have tons of shards in total (you’re not there, but you seem to be getting there), the cluster will have to deal with a big cluster state, which will slow down non-data operations such as adding an index or recovering from a full cluster restart. I would try to keep the number of shards under 10K if possible, as a rule of thumb
  1. Yes, OpenSearch will likely need more “reserved” memory than your heap size. So if you request 10GB and allocate 10GB as heap, you’re not likely to bump into the 12GB limit, but you might exceed the 10GB you requested and maybe the node won’t have it.

If you’re looking for other tunables, especially for logs and other time-series data, I think you’ll find this presentation useful (oldie but hopefully goldie): Elasticsearch for logs and metrics: A deep dive – Velocity 2016, O’REILLY CONFERENCES - YouTube

Thank you for the thorough answer with helpful insights. I have a few follow up questions.

As I learn more and more about how OpenSearch works and the current cluster, I found out that the sharding strategy applied right now is 1:1 with 1 replica. So these almost 26k shards represent 13k indices with their replicas.

  1. Lets say I try to keep the shards under 10k (which would in our case mean 5k indices + replicas), that would mean that the incoming data (somewhere in Filebeat or Logstash) should be restructured if I am understanding correctly?

  2. Or if we abide by the 10k max shards rule, should we just include another OpenSearch instance?

  3. The number of shards as I understand reflects on the number of requests the allocation system makes and that in turn reflects on the RAM usage of the client nodes, correct?

You’re welcome! To answer your follow-up questions:

  1. You don’t have to restructure your data, it’s just that you’ll want to use Index State Management to roll your indices by size. There’s some info in the video above about that, but since you say you’re using Kubernetes, maybe this post (there’s a video there, too) will be more useful: Autoscaling Elasticsearch for Logs with a Kubernetes Operator

The main idea there is to spread your indices (those that have significant traffic) to all your nodes. Which also implies that if you can have fewer&bigger nodes it will be easier to manage than having many small nodes.

  1. You can also have multiple clusters if that works for your use-case (e.g. different “classes” of logs are searched separately). Judging by the total size, I don’t think that’s something you have to do in order to stay under 10K (you’ll likely be able to achieve that just by rotating indices by size). But if you can afford to do that, it might make things easier to manage, more independent. And certainly easier to scale.

  2. There is a RAM (heap) overhead for each shard. And - given the same merge policy - more shards will imply more segments, so data will be less “compacted”, occupying more disk space, cache memory and it will be slower to search (just the full-text search part, the aggregation performance shouldn’t suffer). So I think the short answer is “no, not significantly”, but I usually recommend limiting the number of shards because:
    A) Once you have too many, the cluster will become unstable (e.g. some master-related operations will timeout because the master may be too slow to replicate the cluster state changes).
    B) It’s easier to balance the number of shards across your data nodes if you rotate indices by size. And really the balance of load across your data nodes is - in my experience - the biggest factor when it comes to indexing&query performance.

Thank you!

From all the information I was able to gather, I devised the following strategy.

  1. About a 1000 containers sending data towards Filebeat.
  2. Logstash then filters the data by time and creates monthly index for each container. Main reason being that huge portion of those indices contain small amounts of data, between 5-10mb and less than 1gb for an entire month.
  3. OpenSearch then having a 1:1 sharding strategy. Having single shards for the majority of indices would help with performance.

Result: ~2000 shards.

However, shards would still be very unbalanced because of the portion of indices sending up to 450GB of data per month. Not to mention the CPU limits.

Is it possible to apply a policy to all indices that would say something like:

  • if index goes over 10gb → rollover (so that large indices are broken down basically to more shards automatically)
  • and if data in index older than 7 days → delete data

If possible, how would it look like?

Yes, it’s possible. You can rollover on min_age and min_primary_shard_size, and whichever comes first will apply.

For indices that are very large, you can spread them across your cluster. But this is problematic with 28 nodes, because you’d need 14 shards to be balanced. And if you have e.g. 14GB/day, it means you’d rotate after 10 days in order to get 10GB per shard. That’s too much, IMO, so you can either rotate at 5GB per shard, or live with some unbalance. Or maybe you can have fewer bigger nodes, then it will be easier to balance (you’d need fewer shards).

1 Like

Mulțumesc for all of the suggestions!

Hehe, you’re welcome :slight_smile: