OpenSearch Version 2.12
TLDR;
- set ‘search_backpressure.cancellation_burst’ to ‘10’ (instead of '10.0) crashes Cluster
Question: How to recover the whole instance?
Configuration:
Cluster of 7 nodes (docker setup on dedicated VMware nodes)
1x master+data
2x master
4x data
What we’ve done:
We carefully started to tweak backpressure configuration by
- geting cluster settings with all defaults.
- set backpressure default settings to:
PUT /_cluster/settings
{
"persistent": {
"search_backpressure": {
"mode": "monitor_only",
"cancellation_burst": "10",
"cancellation_ratio": "0.1",
"cancellation_rate": "0.003",
"search_task.elapsed_time_millis_threshold": "45000",
"search_task.heap_variance": "2.0",
"search_task.heap_percent_threshold": "0.02",
"search_task.cancellation_burst": "5.0",
"search_task.cpu_time_millis_threshold": "30000",
"search_task.cancellation_ratio": "0.1",
"search_task.cancellation_rate": "0.003",
"search_task.total_heap_percent_threshold": "0.05",
"search_task.heap_moving_average_window_size": "100",
"node_duress.cpu_threshold": "0.9",
"node_duress.heap_threshold": "0.7",
"node_duress.num_successive_breaches": "3",
"search_shard_task.elapsed_time_millis_threshold": "30000",
"search_shard_task.heap_variance": "2.0",
"search_shard_task.heap_percent_threshold": "0.005",
"search_shard_task.cancellation_burst": "10.0",
"search_shard_task.cpu_time_millis_threshold": "15000",
"search_shard_task.cancellation_ratio": "0.1",
"search_shard_task.cancellation_rate": "0.003",
"search_shard_task.total_heap_percent_threshold": "0.05",
"search_shard_task.heap_moving_average_window_size": "100"
}
}
}
this crashes the cluster due to:
"cancellation_burst": "10",
vs:
"cancellation_burst": "10.0",
The docker logs shows the following error:
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
→ logs can’t be processed no longer from our graylog to Opensearch
→ ‘GET’ can be accessed within Dasboards dev console
→ ‘PUT’ can’t be processed within Dasboards dev console
First Panic reaction:
- stop the whole cluster
- start the whole cluster
result:
Cluster does not come up. The following log entries were fired immediately again and again:
[2024-08-20T09:22:27,818][INFO ][o.o.c.s.ClusterApplierService] [opensearch-master-data-node-33] cluster-manager node changed {previous [{opensearch-master-data-node-33}{yOd-Z9CZR82IUxPxee3KrQ}{ik8U02GyQfSyYwQd_JqNNw}{172.24.0.33}{172.24.0.33:9300}{dimr}{shard_indexing_pressure_enabled=true}], current []}, term: 21261, version: 96514, reason: becoming candidate: clusterApplier#onNewClusterState
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.metadata.perf_analyzer.state] from [] to [0]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [indices.recovery.max_bytes_per_sec] from [41943040b] to [500mb]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] updating [search_backpressure.cancellation_burst] from [10.0] to [10]
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
at org.opensearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:209) ~[opensearch-core-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:275) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.setCancellationBurst(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.settings.Setting$Updater.apply(Setting.java:1254) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:696) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:232) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:558) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-2.12.0.jar:2.12.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalArgumentException: rate must be greater than zero
at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:52) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:47) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.SearchBackpressureState.onRateChanged(SearchBackpressureState.java:95) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.SearchBackpressureState.onBurstChanged(SearchBackpressureState.java:101) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.lambda$setCancellationBurst$2(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:269) ~[opensearch-2.12.0.jar:2.12.0]
... 13 more
What we’ve done:
- try to set the settings via opensearch.yml (no luck)
- try to connect via Opensearch Dashboards (no connection to cluster)
Any help would be appreciated
Thanks
Marcus