Nodes crashes, leader check

Greetings all! Previously, sorry for my English.
Having trouble with ES cluster. Sometimes data-nodes crashes (one or more (rarely)). This can be preceded by different events: force-merge, relocating shards, real-time upload data to indices, creating replicas, but the logs are always the same (I attach the link below to DEBUG-level logs).
About real-time bulk-upload data:

  1. every 10 minutes, 3 indices (15 shard with no replica, kNN)
  2. data size ~100k documents (during rush hour) = ~200-400Mb (depending on the index)
  3. upload time spent ~60 seconds (upload + flush), translog = 512Mb
  4. in total, 10-15 million documents are collected per day (1 day = 1 index), in amount 270 indices with replicas (3 months, 3 different sources)
  5. example index template:

{
“settings”: {
“index”: {
“knn”: “true”,
“number_of_shards”: 15,
“number_of_replicas”: 0,
“knn.space_type”: “cosinesimil”,
“knn.algo_param.m”: 48,
“knn.algo_param.ef_construction”: 8192,
“knn.algo_param.ef_search”: 8192,
“max_result_window”: 1000000
}
},
“mappings”: {
“_source”: {
“excludes”: [
“my_vector”
]
},
“properties”: {
“detect_id”: {
“type”: “long”
},
“cam_id”: {
“type”: “integer”
},
“time_check”: {
“format”: “yyyy-MM-dd HH:mm:ss”,
“store”: True,
“type”: “date”
},
“my_vector”: {
“type”: “knn_vector”,
“dimension”: 320
}
}
}
}

The same indexes are then force_merged (at night, when the load is small)

Cluster formation:

  1. 3 master nodes, 4core, RAM=4gb, heap=2gb, configuration file:

cluster.name: open-distro-cluster
node.name: es-masternode001
node.master: true
node.data: false
node.ingest: false
network.host: 10.250.9.100
discovery.seed_hosts: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]
cluster.initial_master_nodes: [“10.250.9.100”, “10.250.9.102”, “10.250.9.104”]

  1. 3 coordinator nodes, 14core, RAM=30gb, heap=15gb, same configuration file (node.master: false)

  2. 70 data-nodes, 18core, RAM=368gb, heap=30gb, same configuration file (node.master: false, node.data:True, node.ingest:True)

  3. cluster changed parameters:

{
“persistent”: {
“cluster”: {
“routing”: {
“rebalance”: {
“enable”: “none”
},
“allocation”: {
“allow_rebalance”: “indices_all_active”,
“cluster_concurrent_rebalance”: “15”,
“node_concurrent_recoveries”: “2”,
“disk”: {
“threshold_enabled”: “true”,
“watermark”: {
“low”: “80%”,
“high”: “85%”
}
},
“enable”: “all”,
“node_concurrent_outgoing_recoveries”: “2”
}
},
“metadata”: {
“perf_analyzer”: {
“state”: “0”
}
}
},
“knn”: {
“algo_param”: {
“index_thread_qty”: “4”
},
“memory”: {
“circuit_breaker”: {
“limit”: “80%”,
“enabled”: “true”
}
}
}
},
“transient”: {}
}

  1. master-log and data-node log:
    ES logs - Google Drive

first crash at “2021-06-21T12:12:59,356Z”, manually restart ES service at “2021-06-21T12:47:46,314Z”, normally works at “2021-06-21T12:53:54,577Z”. At the same time, two other nodes broke, near to “2021-06-21T14:14:16,213Z” all nodes in cluster started working normally

second crash at “2021-06-21T22:35:17,319Z”, normally started at “2021-06-22T01:22:43,459Z” (without my doing anything)

Thoughts about it - it is not trouble with network (because every crash = any actions other than real-time upload data, the cluster can work normally for a week and more). Monitoring in Zabbix shows no anomalies, CPU utilization ~25-30%, RAM utilization on every node depends on size of indices. I can provide logs of other crashes, but in general the situation is the same everywhere - many times “leader check”, then starts working normally… I do not know what the problem may be, I am grateful for any advice! Thanks!

hey @doc113 - what version are you using?

I completely forgot to specify - 1.12.0

1 Like

Hi @searchymcsearchface , @doc113 I just wanted to add that I’m seeing the same issue when using 1.12.0 in aws opensearch service.

So far nodes are deterministically disconnecting everytime I do (1) force merge or (2) an index warm up. For both actions I see the issues arise on indices that use a FAISS model with

  • dimension 512
  • method = ivf
  • encoding = pq
  • m = 256 (seen the same at 512)
  • nlist = 2048
  • nprobes = 25
    Trained on 1M example vectors.

The warm up actions are on indices with ~3-10M vectors across 8 primary shards, and end up causing a short disconnect for one or more nodes very consistently. The cluster recovers afterwards and generally a repeated attempt will be successful.

My cluster is currently 10 data nodes, each with 32GB ram - but we were seeing the same issue when running 40 nodes in our cluster also.

I’ll add that I don’t think this happened when using an index built using nmslib, only FAISS. Nmslib was also much quicker to warm up.

Any thoughts since the last time this topic was visited?