Greetings all! Previously, sorry for my English.
Having trouble with ES cluster. Sometimes data-nodes crashes (one or more (rarely)). This can be preceded by different events: force-merge, relocating shards, real-time upload data to indices, creating replicas, but the logs are always the same (I attach the link below to DEBUG-level logs).
About real-time bulk-upload data:
- every 10 minutes, 3 indices (15 shard with no replica, kNN)
- data size ~100k documents (during rush hour) = ~200-400Mb (depending on the index)
- upload time spent ~60 seconds (upload + flush), translog = 512Mb
- in total, 10-15 million documents are collected per day (1 day = 1 index), in amount 270 indices with replicas (3 months, 3 different sources)
- example index template:
{
“settings”: {
“index”: {
“knn”: “true”,
“number_of_shards”: 15,
“number_of_replicas”: 0,
“knn.space_type”: “cosinesimil”,
“knn.algo_param.m”: 48,
“knn.algo_param.ef_construction”: 8192,
“knn.algo_param.ef_search”: 8192,
“max_result_window”: 1000000
}
},
“mappings”: {
“_source”: {
“excludes”: [
“my_vector”
]
},
“properties”: {
“detect_id”: {
“type”: “long”
},
“cam_id”: {
“type”: “integer”
},
“time_check”: {
“format”: “yyyy-MM-dd HH:mm:ss”,
“store”: True,
“type”: “date”
},
“my_vector”: {
“type”: “knn_vector”,
“dimension”: 320
}
}
}
}
The same indexes are then force_merged (at night, when the load is small)
Cluster formation:
- 3 master nodes, 4core, RAM=4gb, heap=2gb, configuration file:
cluster.name: open-distro-cluster
node.name: es-masternode001
node.master: true
node.data: false
node.ingest: false
network.host: 10.250.9.100
discovery.seed_hosts: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]
cluster.initial_master_nodes: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]
-
3 coordinator nodes, 14core, RAM=30gb, heap=15gb, same configuration file (node.master: false)
-
70 data-nodes, 18core, RAM=368gb, heap=30gb, same configuration file (node.master: false, node.data:True, node.ingest:True)
-
cluster changed parameters:
{
“persistent”: {
“cluster”: {
“routing”: {
“rebalance”: {
“enable”: “none”
},
“allocation”: {
“allow_rebalance”: “indices_all_active”,
“cluster_concurrent_rebalance”: “15”,
“node_concurrent_recoveries”: “2”,
“disk”: {
“threshold_enabled”: “true”,
“watermark”: {
“low”: “80%”,
“high”: “85%”
}
},
“enable”: “all”,
“node_concurrent_outgoing_recoveries”: “2”
}
},
“metadata”: {
“perf_analyzer”: {
“state”: “0”
}
}
},
“knn”: {
“algo_param”: {
“index_thread_qty”: “4”
},
“memory”: {
“circuit_breaker”: {
“limit”: “80%”,
“enabled”: “true”
}
}
}
},
“transient”: {}
}
- master-log and data-node log:
ES logs - Google Drive
first crash at “2021-06-21T12:12:59,356Z”, manually restart ES service at “2021-06-21T12:47:46,314Z”, normally works at “2021-06-21T12:53:54,577Z”. At the same time, two other nodes broke, near to “2021-06-21T14:14:16,213Z” all nodes in cluster started working normally
second crash at “2021-06-21T22:35:17,319Z”, normally started at “2021-06-22T01:22:43,459Z” (without my doing anything)
Thoughts about it - it is not trouble with network (because every crash = any actions other than real-time upload data, the cluster can work normally for a week and more). Monitoring in Zabbix shows no anomalies, CPU utilization ~25-30%, RAM utilization on every node depends on size of indices. I can provide logs of other crashes, but in general the situation is the same everywhere - many times “leader check”, then starts working normally… I do not know what the problem may be, I am grateful for any advice! Thanks!