Nodes crashes, leader check

doc113 · June 22, 2021, 1:38pm

Greetings all! Previously, sorry for my English.
Having trouble with ES cluster. Sometimes data-nodes crashes (one or more (rarely)). This can be preceded by different events: force-merge, relocating shards, real-time upload data to indices, creating replicas, but the logs are always the same (I attach the link below to DEBUG-level logs).
About real-time bulk-upload data:

every 10 minutes, 3 indices (15 shard with no replica, kNN)
data size ~100k documents (during rush hour) = ~200-400Mb (depending on the index)
upload time spent ~60 seconds (upload + flush), translog = 512Mb
in total, 10-15 million documents are collected per day (1 day = 1 index), in amount 270 indices with replicas (3 months, 3 different sources)
example index template:

{
“settings”: {
“index”: {
“knn”: “true”,
“number_of_shards”: 15,
“number_of_replicas”: 0,
“knn.space_type”: “cosinesimil”,
“knn.algo_param.m”: 48,
“knn.algo_param.ef_construction”: 8192,
“knn.algo_param.ef_search”: 8192,
“max_result_window”: 1000000
}
},
“mappings”: {
“_source”: {
“excludes”: [
“my_vector”
]
},
“properties”: {
“detect_id”: {
“type”: “long”
},
“cam_id”: {
“type”: “integer”
},
“time_check”: {
“format”: “yyyy-MM-dd HH:mm:ss”,
“store”: True,
“type”: “date”
},
“my_vector”: {
“type”: “knn_vector”,
“dimension”: 320
}
}
}
}

The same indexes are then force_merged (at night, when the load is small)

Cluster formation:

3 master nodes, 4core, RAM=4gb, heap=2gb, configuration file:

cluster.name: open-distro-cluster
node.name: es-masternode001
node.master: true
node.data: false
node.ingest: false
network.host: 10.250.9.100
discovery.seed_hosts: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]
cluster.initial_master_nodes: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]

3 coordinator nodes, 14core, RAM=30gb, heap=15gb, same configuration file (node.master: false)
70 data-nodes, 18core, RAM=368gb, heap=30gb, same configuration file (node.master: false, node.data:True, node.ingest:True)
cluster changed parameters:

{
“persistent”: {
“cluster”: {
“routing”: {
“rebalance”: {
“enable”: “none”
},
“allocation”: {
“allow_rebalance”: “indices_all_active”,
“cluster_concurrent_rebalance”: “15”,
“node_concurrent_recoveries”: “2”,
“disk”: {
“threshold_enabled”: “true”,
“watermark”: {
“low”: “80%”,
“high”: “85%”
}
},
“enable”: “all”,
“node_concurrent_outgoing_recoveries”: “2”
}
},
“metadata”: {
“perf_analyzer”: {
“state”: “0”
}
}
},
“knn”: {
“algo_param”: {
“index_thread_qty”: “4”
},
“memory”: {
“circuit_breaker”: {
“limit”: “80%”,
“enabled”: “true”
}
}
}
},
“transient”: {}
}

master-log and data-node log:
ES logs - Google Drive

first crash at “2021-06-21T12:12:59,356Z”, manually restart ES service at “2021-06-21T12:47:46,314Z”, normally works at “2021-06-21T12:53:54,577Z”. At the same time, two other nodes broke, near to “2021-06-21T14:14:16,213Z” all nodes in cluster started working normally

second crash at “2021-06-21T22:35:17,319Z”, normally started at “2021-06-22T01:22:43,459Z” (without my doing anything)

Thoughts about it - it is not trouble with network (because every crash = any actions other than real-time upload data, the cluster can work normally for a week and more). Monitoring in Zabbix shows no anomalies, CPU utilization ~25-30%, RAM utilization on every node depends on size of indices. I can provide logs of other crashes, but in general the situation is the same everywhere - many times “leader check”, then starts working normally… I do not know what the problem may be, I am grateful for any advice! Thanks!

searchymcsearchface · June 22, 2021, 2:05pm

hey @doc113 - what version are you using?

doc113 · June 22, 2021, 2:31pm

I completely forgot to specify - 1.12.0

colinmatthew · August 19, 2022, 4:57pm

Hi @searchymcsearchface , @doc113 I just wanted to add that I’m seeing the same issue when using 1.12.0 in aws opensearch service.

So far nodes are deterministically disconnecting everytime I do (1) force merge or (2) an index warm up. For both actions I see the issues arise on indices that use a FAISS model with

dimension 512
method = ivf
encoding = pq
m = 256 (seen the same at 512)
nlist = 2048
nprobes = 25
Trained on 1M example vectors.

The warm up actions are on indices with ~3-10M vectors across 8 primary shards, and end up causing a short disconnect for one or more nodes very consistently. The cluster recovers afterwards and generally a repeated attempt will be successful.

My cluster is currently 10 data nodes, each with 32GB ram - but we were seeing the same issue when running 40 nodes in our cluster also.

I’ll add that I don’t think this happened when using an index built using nmslib, only FAISS. Nmslib was also much quicker to warm up.

Any thoughts since the last time this topic was visited?

Topic		Replies	Views
Nodes crashed and kNN questions k-NN	5	1107	February 27, 2021
Elastic Search is getting crashed after running KNN query k-NN	1	1387	April 21, 2020
Nodes fall out of the cluster es 7.9.1 Open Source Elasticsearch and Kibana	4	1813	March 29, 2021
Indexing Causing Index to Enter Red State k-NN	12	1303	September 4, 2020
Circuit_breaker_triggered: True and real-time data upload k-NN	1	882	February 2, 2021

Nodes crashes, leader check

Related topics