Opensearch Performance tuning

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.9.0

Describe the issue:

Opensearch ingestion is very slow at times. I read through performance tuning page and not clear what commands to run to activate some of the recommendations there.

Appreciate sharing commands used to activate recommendations like

  • Increase the number of indexing threads.
  • Reduce segment count.
  • Use mmap file I/O.

Thanks

Configuration:

Cluster made of 3 VMs , each 10cpus/60GB memory/3TB disk (SSD). JVM is 31GB.

Relevant Logs or Screenshots:

I upgraded to 2.10.0 and still have slowness issue.

My setup

Kubernetes clusters fluentd → 4 logstash VMs —> opensearch cluster.

I think it’s down to monitoring OpenSearch. My quick rule of thumb would be:

  • do you max out the CPU? If you do, but only on one node, the cluster is imbalanced. If you do on all nodes, then there are some optimizations you can do, most notably checking whether you need to store/index/doc_values all those fields or tuning the merge policy (yes, the link is for Solr, but you’ll find the same options for OpenSearch)
  • if you don’t max out the CPU, is disk (latency?) or network the bottleneck? Maybe you only hit one node with all the bulk requests and it chokes, or something like that.
  • if it’s not CPU, nor disk/network, then it’s probably not OpenSearch :slight_smile: The best way to check is in the thread pool metrics: if you don’t have any (significant) queue on the writes and if you don’t use all your write threads, you’re probably not pushing it hard enough. Maybe you need more Logstash instances or more threads on existing pipelines if you don’t push existing ones hard enough.

Also, FluentD can also push data directly to OpenSearch, and should be able to do most of what Logstash does.

@radu.gheorghe Thanks for detailed response. I tried fluentd → opensearch directly and hit bulk circuit breaker several times.
On OS nodes: CPU utilization is fine , underlying disk is SSD , memory utilization hovers around 34GB most of the time.
How I collect thread pool metrics over time. Is there a way to monitor opensearch metrics ?

Thanks

You’re welcome. Yes, we have a tool that monitors OpenSearch: OpenSearch Monitoring Integration

For thread pool metrics in particular, you can hit the _cat/thread_pool endpoint (which is what we do as well, we just collect these metrics over time).

As for fluentd → OpenSearch, which circuit breaker did you hit exactly? I’m thinking that maybe you can change the fluentd settings and make it send fewer (and bigger) batches.

If CPU is fine and have local SSDs (not over some network), then I’m pretty sure the bottleneck is not OpenSearch.

@radu.gheorghe thanks for response. The current fluentd config I used is listed below.

<source>
  @type forward
  @id main_forward
  bind 0.0.0.0
  port 24240
</source>
<match **>
  @type label_router
  @id main
  metrics true
  <route>
    @label @83cf0df031f8ae5d521498c7a7ee8ff9
    metrics_labels {"id":"clusterflow:cattle-logging-system:cluster-wide-logs"}
    <match>
      namespaces
      negate false
    </match>
  </route>
</match>
<label @83cf0df031f8ae5d521498c7a7ee8ff9>
  <filter **>
    @type dedot
    @id clusterflow:cattle-logging-system:cluster-wide-logs:0
    de_dot_nested true
    de_dot_separator -
  </filter>
  <filter **>
    @type record_modifier
    @id clusterflow:cattle-logging-system:cluster-wide-logs:1
    <record>
      cluster xxxxxxx
    </record>
    <record>
      group yyyy
    </record>
  </filter>
  <match **>
    @type forward
    @id clusterflow:cattle-logging-system:cluster-wide-logs:clusteroutput:cattle-logging-system:cluster-logs-to-logstash
    slow_flush_log_threshold 150
    transport tcp
    <buffer tag,time>
      @type file
      chunk_limit_size 16M
      flush_interval 5s
      flush_mode interval
      flush_thread_count 10
      path /buffers/clusterflow:cattle-logging-system:cluster-logs:clusteroutput:cattle-logging-system:cluster-logs-to-logstash.*.buffer
      retry_forever true
      timekey 10m
      timekey_use_utc true
      timekey_wait 1m
    </buffer>
    <server>
      host logstash_server1
      port 7000
    </server>
    <server>
      host logstash_server2
      port 7000
    </server>
    <server>
      host logstash_server3
      port 7000
    </server>
  </match>
</label>
<label @ERROR>
  <match **>
    @type null
    @id main-fluentd-error
  </match>
</label>

Sounds like you’re sending from Fluentd to Logstash, not directly to OpenSearch. Maybe that would be another (better?) option. The buffer options of you forward output look good to me, I would be surprised if this part is the bottleneck. Not sure about the filters, though. Maybe you can check Logstash and see where you are there with regards to threads, CPU usage, etc.