Unable to start opensearch: loop 'failed to apply settings' and 'rate must be greater than zero'

OpenSearch Version 2.12

TLDR;

  • set ‘search_backpressure.cancellation_burst’ to ‘10’ (instead of '10.0) crashes Cluster

Question: How to recover the whole instance?

Configuration:
Cluster of 7 nodes (docker setup on dedicated VMware nodes)
1x master+data
2x master
4x data

What we’ve done:
We carefully started to tweak backpressure configuration by

  • geting cluster settings with all defaults.
  • set backpressure default settings to:
PUT /_cluster/settings
{
  "persistent": {
    "search_backpressure": {
      "mode": "monitor_only",
      "cancellation_burst": "10",
      "cancellation_ratio": "0.1",
      "cancellation_rate": "0.003",

      "search_task.elapsed_time_millis_threshold": "45000",
      "search_task.heap_variance": "2.0",
      "search_task.heap_percent_threshold": "0.02",
      "search_task.cancellation_burst": "5.0",
      "search_task.cpu_time_millis_threshold": "30000",
      "search_task.cancellation_ratio": "0.1",
      "search_task.cancellation_rate": "0.003",
      "search_task.total_heap_percent_threshold": "0.05",
      "search_task.heap_moving_average_window_size": "100",

      "node_duress.cpu_threshold": "0.9",
      "node_duress.heap_threshold": "0.7",
      "node_duress.num_successive_breaches": "3",
      
      "search_shard_task.elapsed_time_millis_threshold": "30000",
      "search_shard_task.heap_variance": "2.0",
      "search_shard_task.heap_percent_threshold": "0.005",
      "search_shard_task.cancellation_burst": "10.0",
      "search_shard_task.cpu_time_millis_threshold": "15000",
      "search_shard_task.cancellation_ratio": "0.1",
      "search_shard_task.cancellation_rate": "0.003",
      "search_shard_task.total_heap_percent_threshold": "0.05",
      "search_shard_task.heap_moving_average_window_size": "100"
      
    }
  }
}

this crashes the cluster due to:

      "cancellation_burst": "10",
vs:
      "cancellation_burst": "10.0",

The docker logs shows the following error:

[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero

→ logs can’t be processed no longer from our graylog to Opensearch
→ ‘GET’ can be accessed within Dasboards dev console
→ ‘PUT’ can’t be processed within Dasboards dev console

First Panic reaction:

  • stop the whole cluster
  • start the whole cluster

result:
Cluster does not come up. The following log entries were fired immediately again and again:

[2024-08-20T09:22:27,818][INFO ][o.o.c.s.ClusterApplierService] [opensearch-master-data-node-33] cluster-manager node changed {previous [{opensearch-master-data-node-33}{yOd-Z9CZR82IUxPxee3KrQ}{ik8U02GyQfSyYwQd_JqNNw}{172.24.0.33}{172.24.0.33:9300}{dimr}{shard_indexing_pressure_enabled=true}], current []}, term: 21261, version: 96514, reason: becoming candidate: clusterApplier#onNewClusterState
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.metadata.perf_analyzer.state] from [] to [0]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_bytes_per_sec] from [41943040b] to [500mb]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [search_backpressure.cancellation_burst] from [10.0] to [10]
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
	at org.opensearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:209) ~[opensearch-core-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:275) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.setCancellationBurst(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.Setting$Updater.apply(Setting.java:1254) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:696) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:232) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:558) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-2.12.0.jar:2.12.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalArgumentException: rate must be greater than zero
	at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:52) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:47) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.SearchBackpressureState.onRateChanged(SearchBackpressureState.java:95) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.SearchBackpressureState.onBurstChanged(SearchBackpressureState.java:101) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.lambda$setCancellationBurst$2(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
	at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:269) ~[opensearch-2.12.0.jar:2.12.0]
	... 13 more

What we’ve done:

  • try to set the settings via opensearch.yml (no luck)
  • try to connect via Opensearch Dashboards (no connection to cluster)

Any help would be appreciated

Thanks
Marcus

Seems to be a bug, not sure why this happens, dive deep into it.

@mawen222222 You can use the opensearch-node tool to remove the invalid cluster settings from the cluster state, then the cluster can restore.

./bin/opensearch-node remove-settings search_backpressure.cancellation_burst