Massive indexing performance degradation after 1.2.4->1.3.1 update

faust93 · April 13, 2022, 10:55am

Hi,
We faced almost twice indexing rate drop after Opensearch 1.2.4 → 1.3.1 version upgrade, from 100k index calls per/s to 40-50k
Our cluster consists of 10 hot ingestion/data nodes (24h retention) + 49 warm data nodes and flow around 100k events during spike hours
Configuration, hardware, log shippers etc - all remains the same, just Opensearch version update

Any ideas where to look and what to check? Logs are empty, nothing suspicious there.
Or is it possible to downgrade from 1.3.x to 1.2.x ?

BlackMetalz · April 14, 2022, 4:36am

Is your CPU spike, or disk busy?
There is several reasons for performance drop so it is hard to tell what exactly caused it.

And downgrade is not possible I guess xD

faust93 · April 14, 2022, 5:01am

Yes, cluster hardware resources are quite busy but as with 1.2.4 version it was able to cope with incoming data flow.

But it seems we found a culprit - switching default JDK v11.0.14.1 shipped with Opensearch 1.3.1 to JDK v15.0.1 made everything great again!

BlackMetalz · April 14, 2022, 8:13am

That is good news, can you share more information about it like how it can cause performance drop? xD

faust93 · April 14, 2022, 9:05am

Ok I’ll share some graphs below.

How it usually looks (a day before version upgrade, it’s more or less the same for all the previous days):

1.2.4 → 1.3.1 version up (around 7:00 UTC):

Next day, JDK switch 11.0.14 → 15.0.1 (around 16:00 UTC):

BlackMetalz · April 14, 2022, 3:47pm

Good share
20 char…

nateynate · April 14, 2022, 11:35pm

I’m curious if you’re running into what I’m reading about here:

https://github.com/opensearch-project/OpenSearch/issues/2820

Does this match your use case?

mrweber · April 15, 2022, 1:54pm

Thank you, I just migrated a project from elasticsearch 7.16.2 to opensearch 1.3.0 and noticed a huge indexing performance difference as well! Now I need to try and update the jdk in the docker image…

faust93 · April 15, 2022, 2:18pm

I think that’s not my case.
We were upgrading from 1.2.4 to 1.3.1 and if i’m not wrong they are using the same Lucene v8.10.1:
opensearch-1.2.4/lib/lucene-core-8.10.1.jar
opensearch-1.3.1/lib/lucene-core-8.10.1.jar

Although JDK update made things much better we still observing indexing performance degradation compared to that numbers we had widh Opensearch v1.2.4 and the situation is ridiculous tbh, since we can’t stay with the 1.3.1 because it’s incapable to cope with the data flow we have and we can’t downgrade to 1.2.4 without losing the data…

reta · April 15, 2022, 3:32pm

@faust93 there are not many high impact changes between 1.2.x and 1.3.x (at first) but certainly it seems like something is holding the ingestion back. Is it possible to fetch some stats regarding hot threads [1] and indexing backpressure [2] from busy ingestion nodes? Thank you.

[1] Nodes hot threads API | Elasticsearch Guide [7.10] | Elastic
[2] Shard Indexing Backpressure in OpenSearch · OpenSearch

faust93 · April 15, 2022, 6:48pm

I see, but something has happened definitely.
As for the hot threads, every ingesting node reports about [transport_worker] as a top most cpu consumer:

::: {node-03}{BvNSbAEWRf6huodWBI3Niw}{YIb9vOCZQOG9lx1AvCfokQ}{10.202.108.12}{10.202.108.12:9300}{dir}{temp=hot, shard_indexing_pressure_enabled=true}
   Hot threads at 2022-04-15T17:37:03.784Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   90.0% (450ms out of 500ms) cpu usage by thread 'opensearch[node-03][transport_worker][T#1]'
     9/10 snapshots sharing following 144 elements
Items marked red:
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:169)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:97)
       app//org.opensearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:637)
       app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
       app//org.opensearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:784)
       app//org.opensearch.action.bulk.TransportBulkAction.doInternalExecute(TransportBulkAction.java:308)
       app//org.opensearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:219)
       app//org.opensearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:116)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:194)
       org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:120)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:319)
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:169)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:97)
       app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
       app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
       app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
       app//org.opensearch.client.support.AbstractClient.bulk(AbstractClient.java:514)
       app//org.opensearch.rest.action.document.RestBulkAction.lambda$prepareRequest$0(RestBulkAction.java:129)
       app//org.opensearch.rest.action.document.RestBulkAction$$Lambda$4469/0x0000000801a83110.accept(Unknown Source)
       app//org.opensearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:128)
       org.opensearch.security.filter.SecurityRestFilter$1.handleRequest(SecurityRestFilter.java:126)
       app//org.opensearch.rest.RestController.dispatchRequest(RestController.java:306)
       app//org.opensearch.rest.RestController.tryAllHandlers(RestController.java:392)
       app//org.opensearch.rest.RestController.dispatchRequest(RestController.java:235)
       app//org.opensearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:361)
       app//org.opensearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:440)
       app//org.opensearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:351)

As for backpressure - it’s explicitly disabled by setting “shard_indexing_pressure.enabled : false” so all the stats about it are zero.

Below are some observations and experiments.
Briefly about configuration first:
10 ingestion data nodes. Rolling index with 10 shards.
Every opensearch ingestion node runs fluent-bit with ‘forward’ input sink. Client nodes are running fluent-bits as well. Output sink on the client nodes configured as an upstream with indication of all the opensearch target nodes, so there’s a kind of round-robin.

Using configuration above, before 1.3.1 update we had constant indexing rate matched more or less with ingestion flow:

So every node performed around 8-10k indexing operations/s

After upgrading to 1.3.1 and JDK replacement we got this:

Incoming flow is still the same - 100k +/-, nodes indexing performance has dropped to 6.5-7k

I tried to experiment with amount of shards (5,8,3) and got this:

So I’m going slowly mad cause it turns out that just 3 nodes able to outperform the whole set of 10? Or what’s wrong with my cluster?

Topic		Replies	Views
Open Search cluster is running high cpu and response time is also high OpenSearch discuss	11	2687	November 9, 2023
Slow query performance issue Performance Analyzer	39	3916	February 21, 2023
JVM memory usage increased after updating to 1.12 version Open Source Elasticsearch and Kibana	3	599	March 16, 2021
Question about interesting ingestion benchmarking results vs an older ElasticSearch OpenSearch	0	358	January 19, 2023
High cpu on data nodes OpenSearch troubleshoot	4	372	August 20, 2024

Massive indexing performance degradation after 1.2.4->1.3.1 update

Related topics