Securing resources for Anomaly Detection

tom_dom · June 29, 2021, 4:12pm

Hello, I have a question regarding resources management when there are Anomaly Detection jobs on the cluster.
We have standard scenario:

we have a cluster used for monitoring (ETL jobs deliver the data, users are consuming dashboards)
some users are doing additionall ad-hoc data analysis
some users are defining new ML jobs using Anomaly Detection module.

As I understand by default AD jobs are using datanodes, which can result in all cluster nodes to have problems if somebody will use too much resources for AD.
On “standard elastic” there is a way of dedicating nodes for ML jobs by assigning them ml role.

So is there a way of assigning AD nodes like ML role?
Or how you guys are managing limitation or making sure there is a resource pool for core cluster actions so the AD jobs will not cause problems like too high load on data nodes or Out Of Memory errors?

BR
TD

ylwu · August 25, 2021, 7:28am

Currently AD plugin will run on data nodes. We are planning to add ML role, but still researching. Welcome any suggestion.

AD has some protection mechanism to avoid using too much resource

Circuit breaker: AD will stop running if JVM heap usage exceeds 85%
Run AD in dedicated thread pool
Track memory usage of AD model, and limit the memory usage
Limit how many detectors can run per cluster via dynamic setting, default is 1000

rlevitsky · November 16, 2023, 8:17am

Just found this topic running OpenSearch 2.9.0
I have the Circuit Breaker issue running just three AD jobs on a cluster of four nodes, event after rising the -Xmx from 16GB to 28GB:

[date][INFO ][o.o.a.AnomalyDetectorJobRunner] [v480-myco.com] Start to run AD job jr4fuIsBpstUsi_fgPA2
[date][INFO ][o.o.a.t.AnomalyResultTransportAction] [v480-myco.com] Sending RCF request to LAOZnyk0R0y56B9-kGGdlw for model jr4fuIsBpstUsi_fgPA2_model_rcf_0
[date][ERROR][o.o.a.t.AnomalyResultTransportAction] [v480-myco.com] Received an error from node LAOZnyk0R0y56B9-kGGdlw while doing model inference for jr4fuIsBpstUsi_fgPA2
org.opensearch.transport.RemoteTransportException: [v483-myco.com][10.XXX.ZZ.13:9300][cluster:admin/opendistro/adinternal/rcf/result]
Caused by: org.opensearch.common.io.stream.NotSerializableExceptionWrapper: limit_exceeded_exception: AD memory circuit is broken.
        at org.opensearch.ad.transport.RCFResultTransportAction.doExecute(RCFResultTransportAction.java:63) ~[?:?]

Could you please help me get our AD jobs not being stopped by circuit braker?

Topic		Replies	Views
Question: Machine Learning - Node Heap Usage Machine Learning	2	712	March 5, 2021
Anomaly detection permissions role? Machine Learning	1	712	October 7, 2020
Support dedicated ML node Machine Learning discuss , feature-request	1	876	October 6, 2022
Opensearch Job being disabled due to error "Failed validation - [The cluster is breaching the jvm usage threshold [85], cannot execute the transform]" OpenSearch configure	2	30	February 5, 2025
Enable/Disable ML in a specific Node Machine Learning	2	1269	October 6, 2022

Securing resources for Anomaly Detection

Related topics