Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Describe the issue:
Opensearch ingestion is very slow at times. I read through performance tuning page and not clear what commands to run to activate some of the recommendations there.
Appreciate sharing commands used to activate recommendations like
- Increase the number of indexing threads.
- Reduce segment count.
mmap file I/O.
Cluster made of 3 VMs , each 10cpus/60GB memory/3TB disk (SSD). JVM is 31GB.
Relevant Logs or Screenshots:
I upgraded to 2.10.0 and still have slowness issue.
Kubernetes clusters fluentd → 4 logstash VMs —> opensearch cluster.
I think it’s down to monitoring OpenSearch. My quick rule of thumb would be:
- do you max out the CPU? If you do, but only on one node, the cluster is imbalanced. If you do on all nodes, then there are some optimizations you can do, most notably checking whether you need to store/index/doc_values all those fields or tuning the merge policy (yes, the link is for Solr, but you’ll find the same options for OpenSearch)
- if you don’t max out the CPU, is disk (latency?) or network the bottleneck? Maybe you only hit one node with all the bulk requests and it chokes, or something like that.
- if it’s not CPU, nor disk/network, then it’s probably not OpenSearch The best way to check is in the thread pool metrics: if you don’t have any (significant) queue on the writes and if you don’t use all your write threads, you’re probably not pushing it hard enough. Maybe you need more Logstash instances or more threads on existing pipelines if you don’t push existing ones hard enough.
Also, FluentD can also push data directly to OpenSearch, and should be able to do most of what Logstash does.
@radu.gheorghe Thanks for detailed response. I tried fluentd → opensearch directly and hit bulk circuit breaker several times.
On OS nodes: CPU utilization is fine , underlying disk is SSD , memory utilization hovers around 34GB most of the time.
How I collect thread pool metrics over time. Is there a way to monitor opensearch metrics ?
You’re welcome. Yes, we have a tool that monitors OpenSearch: OpenSearch Monitoring Integration
For thread pool metrics in particular, you can hit the
_cat/thread_pool endpoint (which is what we do as well, we just collect these metrics over time).
As for fluentd → OpenSearch, which circuit breaker did you hit exactly? I’m thinking that maybe you can change the fluentd settings and make it send fewer (and bigger) batches.
If CPU is fine and have local SSDs (not over some network), then I’m pretty sure the bottleneck is not OpenSearch.
@radu.gheorghe thanks for response. The current fluentd config I used is listed below.
Sounds like you’re sending from Fluentd to Logstash, not directly to OpenSearch. Maybe that would be another (better?) option. The buffer options of you forward output look good to me, I would be surprised if this part is the bottleneck. Not sure about the filters, though. Maybe you can check Logstash and see where you are there with regards to threads, CPU usage, etc.