Massive indexing performance degradation after 1.2.4->1.3.1 update

Hi,
We faced almost twice indexing rate drop after Opensearch 1.2.4 → 1.3.1 version upgrade, from 100k index calls per/s to 40-50k
Our cluster consists of 10 hot ingestion/data nodes (24h retention) + 49 warm data nodes and flow around 100k events during spike hours
Configuration, hardware, log shippers etc - all remains the same, just Opensearch version update

Any ideas where to look and what to check? Logs are empty, nothing suspicious there.
Or is it possible to downgrade from 1.3.x to 1.2.x ?

1 Like

Is your CPU spike, or disk busy?
There is several reasons for performance drop so it is hard to tell what exactly caused it.

And downgrade is not possible I guess xD

Yes, cluster hardware resources are quite busy but as with 1.2.4 version it was able to cope with incoming data flow.

But it seems we found a culprit - switching default JDK v11.0.14.1 shipped with Opensearch 1.3.1 to JDK v15.0.1 made everything great again! :wink:

3 Likes

That is good news, can you share more information about it like how it can cause performance drop? xD

Ok I’ll share some graphs below.

How it usually looks (a day before version upgrade, it’s more or less the same for all the previous days):

1.2.4 → 1.3.1 version up (around 7:00 UTC):

Next day, JDK switch 11.0.14 → 15.0.1 (around 16:00 UTC):

2 Likes

Good share :kissing:
20 char…

I’m curious if you’re running into what I’m reading about here:

https://github.com/opensearch-project/OpenSearch/issues/2820

Does this match your use case?

Thank you, I just migrated a project from elasticsearch 7.16.2 to opensearch 1.3.0 and noticed a huge indexing performance difference as well! Now I need to try and update the jdk in the docker image…

I think that’s not my case.
We were upgrading from 1.2.4 to 1.3.1 and if i’m not wrong they are using the same Lucene v8.10.1:
opensearch-1.2.4/lib/lucene-core-8.10.1.jar
opensearch-1.3.1/lib/lucene-core-8.10.1.jar

Although JDK update made things much better we still observing indexing performance degradation compared to that numbers we had widh Opensearch v1.2.4 and the situation is ridiculous tbh, since we can’t stay with the 1.3.1 because it’s incapable to cope with the data flow we have and we can’t downgrade to 1.2.4 without losing the data… :frowning:

1 Like

@faust93 there are not many high impact changes between 1.2.x and 1.3.x (at first) but certainly it seems like something is holding the ingestion back. Is it possible to fetch some stats regarding hot threads [1] and indexing backpressure [2] from busy ingestion nodes? Thank you.

[1] Nodes hot threads API | Elasticsearch Guide [7.10] | Elastic
[2] Shard Indexing Backpressure in OpenSearch · OpenSearch

1 Like

I see, but something has happened definitely.
As for the hot threads, every ingesting node reports about [transport_worker] as a top most cpu consumer:

::: {node-03}{BvNSbAEWRf6huodWBI3Niw}{YIb9vOCZQOG9lx1AvCfokQ}{10.202.108.12}{10.202.108.12:9300}{dir}{temp=hot, shard_indexing_pressure_enabled=true}
   Hot threads at 2022-04-15T17:37:03.784Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   90.0% (450ms out of 500ms) cpu usage by thread 'opensearch[node-03][transport_worker][T#1]'
     9/10 snapshots sharing following 144 elements
Items marked red:
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:169)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:97)
       app//org.opensearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:637)
       app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
       app//org.opensearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:784)
       app//org.opensearch.action.bulk.TransportBulkAction.doInternalExecute(TransportBulkAction.java:308)
       app//org.opensearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:219)
       app//org.opensearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:116)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:194)
       org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:120)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:319)
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:169)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:97)
       app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
       app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
       app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
       app//org.opensearch.client.support.AbstractClient.bulk(AbstractClient.java:514)
       app//org.opensearch.rest.action.document.RestBulkAction.lambda$prepareRequest$0(RestBulkAction.java:129)
       app//org.opensearch.rest.action.document.RestBulkAction$$Lambda$4469/0x0000000801a83110.accept(Unknown Source)
       app//org.opensearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:128)
       org.opensearch.security.filter.SecurityRestFilter$1.handleRequest(SecurityRestFilter.java:126)
       app//org.opensearch.rest.RestController.dispatchRequest(RestController.java:306)
       app//org.opensearch.rest.RestController.tryAllHandlers(RestController.java:392)
       app//org.opensearch.rest.RestController.dispatchRequest(RestController.java:235)
       app//org.opensearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:361)
       app//org.opensearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:440)
       app//org.opensearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:351)

As for backpressure - it’s explicitly disabled by setting “shard_indexing_pressure.enabled : false” so all the stats about it are zero.

Below are some observations and experiments.
Briefly about configuration first:
10 ingestion data nodes. Rolling index with 10 shards.
Every opensearch ingestion node runs fluent-bit with ‘forward’ input sink. Client nodes are running fluent-bits as well. Output sink on the client nodes configured as an upstream with indication of all the opensearch target nodes, so there’s a kind of round-robin.

Using configuration above, before 1.3.1 update we had constant indexing rate matched more or less with ingestion flow:


So every node performed around 8-10k indexing operations/s

After upgrading to 1.3.1 and JDK replacement we got this:


Incoming flow is still the same - 100k +/-, nodes indexing performance has dropped to 6.5-7k

I tried to experiment with amount of shards (5,8,3) and got this:


So I’m going slowly mad cause it turns out that just 3 nodes able to outperform the whole set of 10? Or what’s wrong with my cluster? :slight_smile:

1 Like