Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.9.0 version
Describe the issue:
We performed some search operation and noticed that logs appearing in OpenSearch with a delay
We observed in one of the data container logs “[monitor_only mode] cancelling task [165525] due to high resource consumption [heap usage exceeded [257mb >= 550kb]]” " messages like following:
“message”:"[monitor_only mode] cancelling task [165525] due to high resource consumption [heap usage exceeded [257mb >= 550kb]
Some other log messages also observed with “Running full sweep” \ “Will delay 20535 miliseconds for next execution” messages:
“message”:“Finished executing attempt_transition_step for .opendistro-job-scheduler-lock”
“message”:“Will delay 20535 miliseconds for next execution of job adp-app-logs-2024.04.01”
*“message”:“Executing attempt_transition_step for adp-app-logs-2024.04.01”
“message”:“Finished executing attempt_transition_step for adp-app-logs-2024.04.01”
“message”:“Running full sweep”
I understand that these cancelling requests are coming from search backpressure, but according to opensearch document the search backpressure runs by default in monitor_only and “monitor_only” mode doesn’t actually reject the requests:
Search backpressure modes
Search backpressure runs in monitor_only (default), enforced, or disabled mode. In the enforced mode, the server rejects search requests. In the monitor_only mode, the server does not actually cancel search requests but tracks statistics about them.
According to the provided logs and my node statistics, the search backpressure mode has been set to “monitor_only” So the requests should not have been rejected?
Can someone please help me to understand the following concerns:
1) Even when search backpressure is in monitor_only state, why OpenSearch has “cancelling task [165525] due to high resource consumption [heap usage exceeded [257mb >= 550kb]]” messages ? And whether there is some way to avoid similar issues in the future?
2) What is the reason for constant “Will delay 20535 miliseconds for next execution of” messages in OpenSearch data nodes?
3) What are the tasks that OpenSearch constantly cancels?
4) The logs say heap usage exceeded [257mb >= 550kb] . What are those 550kb (is it configured somewhere)?
5) Is it possible to somehow determine when the particular log was saved into the corresponding index?
Configuration:
jvmHeap:
data: 4096m
ingest: 640m
master: 640m
replicaCount:
data: 3
ingest: 2
master: 3
resources:
data:
limits:
cpu: 3
memory: 8Gi
requests:
cpu: 2.5
memory: 8Gi