Question: Machine Learning - Node Heap Usage

Kerudal · March 4, 2021, 9:03am

Hello! My team and I are using Anomaly Detection as a SIEM tool but we have accoutered several problems with our platform.

Git issue reference

We are wondering why our coordinator nodes kept falling periodically. Here is a sample of different logs encountered when one of the node fell :

[2021-03-02T14:44:42,574][ERROR][c.a.o.s.s.h.n.OpenDistroSecuritySSLNettyHttpServerTransport] [KBN_0] Exception during establishing a SSL connection: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
[2021-03-02T14:45:02,191][WARN ][o.e.m.j.JvmGcMonitorService] [KBN_0] [gc][18279] overhead, spent [3.2s] collecting in the last [3.2s]

Here is the Kibana & Elasticsearch structure we have :

2 coordinator nodes (installed on Kibana VMs to ensure load balancing on the cluster): 4 virtual cores, 15 GB ram, 8G heap size (coordinating)
3 master node: 4 virtual cores, 15 GB ram, 8G heap size (master): not used in this scenario from our understanding
15 data node : 8 virtual cores, 30 GB ram, 16G heap size (ingest & data)
Anomaly detectors: 26 running detectors with around 500 active entities, using in total 600 MiB each

Trying to understand the reasons behind these failures, we came up with different questions :

How is the load balancing done between several node coordinators? Is the heap percentage supposedly equally distributed between all the coordinator nodes?
How are the coordinator nodes linked with the anomaly detector plugin? Is the coordinator node responsible for distributing the trees on different shards? where are the trees of the Random Cut Forest saved? Is the coordinator node responsible for collecting and aggregating the final result like the anomaly grade?

When many detectors are launched the heap memory (here: 75% and 80%) of the coordinator nodes have increased greatly. How would we need to scale the coordinator node, to keep up with the demands? Adding more coordinator nodes? increasing the heap size? having more CPU(s) cores? …
Reducing the batched size can be used as a protection mechanism and reduce the memory overhead per search request. Would this be helpful in the case of detectors as well?
Since the detectors are configured on indexes using rollover we tried to use only the current write index (to reduce heap space errors). But detectors using write alias take 10x more time to initialize than detectors using indexes. Do you have any recommendations on this?

My team and I would be very grateful for the time you take answering all these questions.

ylwu · March 4, 2021, 9:51pm

hi, @Kerudal from your description “26 running detectors with around 500 active entities”, I guess you are using high cardinality detectors. Will ask @kaituo to help.

Kerudal · March 5, 2021, 8:18am

Hi @ylwu , yes you’re right, we are using high cardinality detectors (category field usually: host.ip, host.id,…). Thank you for the contact.

Topic		Replies	Views
Heap Memory Issues Open Source Elasticsearch and Kibana	6	4068	April 6, 2021
High memory usage on master nodes OpenSearch	6	2386	October 3, 2023
How to get an insight of the heap usage Open Source Elasticsearch and Kibana	15	2523	November 15, 2021
Ingest node creating in_flight_requests memory issues Open Source Elasticsearch and Kibana	9	849	April 12, 2021
Kubernetes based deployment - Out of Memory on Ingestors Open Source Elasticsearch and Kibana	1	775	February 7, 2023

Question: Machine Learning - Node Heap Usage

Here is the Kibana & Elasticsearch structure we have :

Trying to understand the reasons behind these failures, we came up with different questions :

Related topics