Hi,
We faced almost twice indexing rate drop after Opensearch 1.2.4 → 1.3.1 version upgrade, from 100k index calls per/s to 40-50k
Our cluster consists of 10 hot ingestion/data nodes (24h retention) + 49 warm data nodes and flow around 100k events during spike hours
Configuration, hardware, log shippers etc - all remains the same, just Opensearch version update
Any ideas where to look and what to check? Logs are empty, nothing suspicious there.
Or is it possible to downgrade from 1.3.x to 1.2.x ?
Thank you, I just migrated a project from elasticsearch 7.16.2 to opensearch 1.3.0 and noticed a huge indexing performance difference as well! Now I need to try and update the jdk in the docker image…
I think that’s not my case.
We were upgrading from 1.2.4 to 1.3.1 and if i’m not wrong they are using the same Lucene v8.10.1:
opensearch-1.2.4/lib/lucene-core-8.10.1.jar
opensearch-1.3.1/lib/lucene-core-8.10.1.jar
Although JDK update made things much better we still observing indexing performance degradation compared to that numbers we had widh Opensearch v1.2.4 and the situation is ridiculous tbh, since we can’t stay with the 1.3.1 because it’s incapable to cope with the data flow we have and we can’t downgrade to 1.2.4 without losing the data…
@faust93 there are not many high impact changes between 1.2.x and 1.3.x (at first) but certainly it seems like something is holding the ingestion back. Is it possible to fetch some stats regarding hot threads [1] and indexing backpressure [2] from busy ingestion nodes? Thank you.
I see, but something has happened definitely.
As for the hot threads, every ingesting node reports about [transport_worker] as a top most cpu consumer:
::: {node-03}{BvNSbAEWRf6huodWBI3Niw}{YIb9vOCZQOG9lx1AvCfokQ}{10.202.108.12}{10.202.108.12:9300}{dir}{temp=hot, shard_indexing_pressure_enabled=true}
Hot threads at 2022-04-15T17:37:03.784Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
90.0% (450ms out of 500ms) cpu usage by thread 'opensearch[node-03][transport_worker][T#1]'
9/10 snapshots sharing following 144 elements
Items marked red:
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:169)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:97)
app//org.opensearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:637)
app//org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
app//org.opensearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:784)
app//org.opensearch.action.bulk.TransportBulkAction.doInternalExecute(TransportBulkAction.java:308)
app//org.opensearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:219)
app//org.opensearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:116)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:194)
org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:120)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:319)
org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:78)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:192)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:169)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:97)
app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
app//org.opensearch.client.support.AbstractClient.bulk(AbstractClient.java:514)
app//org.opensearch.rest.action.document.RestBulkAction.lambda$prepareRequest$0(RestBulkAction.java:129)
app//org.opensearch.rest.action.document.RestBulkAction$$Lambda$4469/0x0000000801a83110.accept(Unknown Source)
app//org.opensearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:128)
org.opensearch.security.filter.SecurityRestFilter$1.handleRequest(SecurityRestFilter.java:126)
app//org.opensearch.rest.RestController.dispatchRequest(RestController.java:306)
app//org.opensearch.rest.RestController.tryAllHandlers(RestController.java:392)
app//org.opensearch.rest.RestController.dispatchRequest(RestController.java:235)
app//org.opensearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:361)
app//org.opensearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:440)
app//org.opensearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:351)
As for backpressure - it’s explicitly disabled by setting “shard_indexing_pressure.enabled : false” so all the stats about it are zero.
Below are some observations and experiments.
Briefly about configuration first:
10 ingestion data nodes. Rolling index with 10 shards.
Every opensearch ingestion node runs fluent-bit with ‘forward’ input sink. Client nodes are running fluent-bits as well. Output sink on the client nodes configured as an upstream with indication of all the opensearch target nodes, so there’s a kind of round-robin.
Using configuration above, before 1.3.1 update we had constant indexing rate matched more or less with ingestion flow: