Open Search cluster is running high cpu and response time is also high

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
version" : {
“distribution” : “opensearch”,
“number” : “1.2.2”,
“build_type” : “tar”,
“build_hash” : “123d41ce4fad54529acd7a290efed848e707b624”,
“build_date” : “2021-12-15T18:03:07.761961Z”,
“build_snapshot” : false,
“lucene_version” : “8.10.1”,
“minimum_wire_compatibility_version” : “6.8.0”,
“minimum_index_compatibility_version” : “6.0.0-beta1”
},

Describe the issue:
Hello All,
We recently migrated our search cluster from ES ( “version” : {
“number” : “6.8.16”,
“build_flavor” : “default”,
“build_type” : “deb”,
“build_hash” : “1f62092”,
“build_date” : “2021-05-21T19:27:57.985321Z”,
“build_snapshot” : false,
“lucene_version” : “7.7.3”,
“minimum_wire_compatibility_version” : “5.6.0”,
“minimum_index_compatibility_version” : “5.0.0”
},

to open search . Wrt the H/W both are same . But we are seeing high cpu spike in open search cluster . not sure what are all the next step . Any help would be appreciated

Configuration:
No of shards : 24 on both OS and ES ( replica count is 1 per primary shard ) .

Relevant Logs or Screenshots:

I think the next step is to monitor OpenSearch and see what’s taking CPU. It could be GC, it could be indexing, queries or maybe some other thread pool that’s doing work. We wrote a metrics guide for Elasticsearch a while ago that mostly applies to OpenSearch.

1 Like

Thanks for the kind reply. After analyze the profile output seeing most of the call taking more time in the TermQuery . Any suggestions to fix this one ?
Thanks

TermQuery is the most basic query that can run, you can’t really optimize that.

Actually you can (with a more aggressive merge policy - the blog post is about Solr, but you have similar options in OpenSearch) but usually the problem is higher up. For example, the number of TermQuery clauses, the layout of your data, number of shards, how well they’re balanced, etc.

1 Like

Thanks for the kind response . One thing we are seeing compare to ES to OS is from the Hardware perspective both are same machines . But in OS profiler data shows the term query section taking ~240ms but the same data ES is taking ~92 ms . Hence not sure what is the problem . Here both shards are same ( 16 ) and 8 data nodes

I don’t know why you’d see that difference besides:

  • the “random” distribution of documents between shards, if you’re letting OS/ES chose IDs
  • the “random” nature of merges, when they kick in
  • the Lucene version

And you can’t do much about any of the above. Which is why I’d generally suggest concentrating on optimizing what you have vs comparing to what you had before. Unless you’re still deciding whether to make the upgrade or not.

1 Like

thanks for the reply . i have increased the shards count to 32 and seeing better performance . Any idea how this shards count plays major role here . still not able to connect these dots .

With more shards you’re parallelizing queries more. But there’s also more overhead in merging per-shard results.

Maybe concurrent segment search will help you? Introducing concurrent segment search in OpenSearch · OpenSearch

1 Like

interesting fact is i created a new cluster with same data ( created a new index and back-filled the data ) . but now i am seeing how response time . Any specific warm ups / warm up period required ? its really super surprising now

Yeah, if you run a query right after ingesting a lot of data, it might be that the OS cache doesn’t have everything it needs in the page cache. Plus, all Elasticsearch-specific query-related caches (query cache, request cache) will be cold.

1 Like

Thanks. is there any way can we improve the things? since we didnt use the k-nn option while create the index since its prod index .

Adding more updates :
Here we added query cache to 20% which helped little bit . But still the cluster CPU is not lowering . On the other side the Elastic search cluster which handling the load very efficient. Any more recommendations would be great ?