So I did a bit of digging in the OpenSearch logs. It’s a bit hard to keep track of with 9 nodes, but here’s what I found.
First of all, I sometimes see huge waves (10-100) of these messages on all nodes:
[2024-01-05T13:38:22,884][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-master01.example.com] Detected cluster change event for destination migration
Not always, but sometimes it’s related to a node crash.
It seems that some nodes randomly fail a health check (by a lot). I posted a bigger log below, but I sometimes see messages like this:
[2024-01-05T13:29:52,440][WARN ][o.o.m.f.FsHealthService ] [clm-ab-os-warm02.example.com] health check of [/data/opensearch/data/nodes/0] took [122902ms] which is above the warn threshold of [5s]
After this, the node leaves the cluster.
I don’t know what a health check does, but failing it by a factor of 24 seems bad.
What could be causing this?
Full logs for these two nodes (had to cut some of the spam from first example):
[2024-01-05T13:23:51,851][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:52,189][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:53,136][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration [2024-01-05T13:24:01,456][INFO ][o.o.j.s.JobSweeper ] [clm-ent-os-master01.example.com] Running full sweep [2024-01-05T13:28:30,408][WARN ][o.o.c.InternalClusterInfoService] [clm-ent-os-master01.example.com] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2024-01-05T13:28:50,324][INFO ][o.o.c.c.FollowersChecker ] [clm-ent-os-master01.example.com] FollowerChecker{discoveryNode={clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=3} health check failed org.opensearch.transport.RemoteTransportException: [clm-ab-os-warm02.example.com][10.186.24.81:9300][internal:coordination/fault_detection/follower_check] Caused by: org.opensearch.cluster.coordination.NodeHealthCheckFailureException: handleFollowerCheck: node is unhealthy [healthy threshold breached], rejecting healthy threshold breached
at org.opensearch.cluster.coordination.FollowersChecker.handleFollowerCheck(FollowersChecker.java:209) ~[opensearch-2.11.1.jar:2.11.1]
...
[2024-01-05T13:28:50,326][INFO ][o.o.c.c.FollowersChecker ] [clm-ent-os-master01.example.com] FollowerChecker{discoveryNode={clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2024-01-05T13:28:50,328][INFO ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] updating number_of_replicas to [5] for indices [.opendistro_security, .opensearch-sap-log-types-config]
[2024-01-05T13:28:50,335][INFO ][o.o.c.s.MasterService ] [clm-ent-os-master01.example.com] node-left[{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true} reason: health check failed], term: 11, version: 52182, delta: removed {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}
[2024-01-05T13:28:50,405][INFO ][o.o.c.s.ClusterApplierService] [clm-ent-os-master01.example.com] removed {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}, term: 11, version: 52182, reason: Publication{term=11, version=52182}
[2024-01-05T13:28:50,406][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Cluster node changed, node removed: true, node added: false
[2024-01-05T13:28:50,406][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] Node removed: [0eBeInmKT_GpyI2Pyf7hzw]
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] Remove data node from AD version hash ring: 0eBeInmKT_GpyI2Pyf7hzw
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Hash ring build result: true
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 2
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] Build AD version hash ring successfully
[2024-01-05T13:28:50,407][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:28:50,408][INFO ][o.o.c.r.DelayedAllocationService] [clm-ent-os-master01.example.com] scheduling reroute for delayed shards in [59.9s] (36 delayed shards)
[2024-01-05T13:28:50,410][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [aaa-000003][2] marking unavailable shards as stale: [YyxspbjDStCe8YshE06lUw]
[2024-01-05T13:28:50,410][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [aaa-000003][3] marking unavailable shards as stale: [SIiUbLgfTN-IxfYAzcYP8g]
[2024-01-05T13:28:50,433][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:28:50,433][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-job-scheduler-lock][0] marking unavailable shards as stale: [O4wt1sjFR0m_TF6xP976_A]
[2024-01-05T13:28:50,458][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:01,457][INFO ][o.o.j.s.JobSweeper ] [clm-ent-os-master01.example.com] Running full sweep
[2024-01-05T13:29:02,355][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro_security][0] marking unavailable shards as stale: [BVS4-1ujQuydagP7Ry_OJg]
[2024-01-05T13:29:02,385][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:06,136][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opensearch-sap-log-types-config][0] marking unavailable shards as stale: [o6zKaCvqRBW2jDToHLzG7w]
[2024-01-05T13:29:06,164][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,329][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,371][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.23-000012][0] marking unavailable shards as stale: [AJy5zLVHSGGaYJMnLgCRFw]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000002][3] marking unavailable shards as stale: [NuqWQEdKQMSr1GZ16JXWXQ]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000003][1] marking unavailable shards as stale: [0MRy1AnORVilVo0kKF4A5g]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000003][0] marking unavailable shards as stale: [k3yL9qOcTiWd_2dXCkwLWQ]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000001][1] marking unavailable shards as stale: [QfjZtkreSn2C44gmft5dXQ]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000009][0] marking unavailable shards as stale: [NsDc36TcRbiLi7oEweKoXA]
[2024-01-05T13:29:50,374][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,401][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,402][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.26-000015][0] marking unavailable shards as stale: [ZOl_x67RTTWOv-TgKoTc7Q]
[2024-01-05T13:29:50,402][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000008][0] marking unavailable shards as stale: [jbev-RWyQPKU5kM9UOQmQw]
[2024-01-05T13:29:50,402][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opensearch-notifications-config][0] marking unavailable shards as stale: [NEGso0NRSbeqLBdEPb9rmg]
[2024-01-05T13:29:50,404][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,424][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,426][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,461][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,483][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:50,557][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [scip-platmgmt-000001][5] marking unavailable shards as stale: [mNTJL41KSwSVaEkqPdc9Ew]
[2024-01-05T13:29:50,557][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.13-000002][0] marking unavailable shards as stale: [etoOPJUcSGCn6HCIlr07Lg]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000001][3] marking unavailable shards as stale: [9xpuGeddSH69En9_zGIkrA]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [bcpe-lab-evnfm-000001][0] marking unavailable shards as stale: [O3VBV9LTQaOXXl45xlnn1w]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [bcpe-prod-evnfm-000001][1] marking unavailable shards as stale: [roN79T7HQ6GcofOGZAsHIg]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000002][0] marking unavailable shards as stale: [A9SYNDlZTQioZ_aL2dlYag]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [rpaserverlogs-scripts-000001][0] marking unavailable shards as stale: [7I4NEnqNRdqImOBT5IDV1Q]
[2024-01-05T13:29:50,590][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,592][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,616][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,617][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2024.01.03-000023][0] marking unavailable shards as stale: [WnlvYuApR6CpCgdOU5G9NA]
[2024-01-05T13:29:50,618][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,652][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,679][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,680][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [clm-logstash-000001][0] marking unavailable shards as stale: [JgJ2uLCsQoKOo_tfgIlEvA]
[2024-01-05T13:29:50,717][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,741][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,741][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.plugins-ml-config][0] marking unavailable shards as stale: [oaW_oAyrSbCZk_5-c-H70g]
[2024-01-05T13:29:50,742][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opensearch-observability][0] marking unavailable shards as stale: [wRFWTpsgTjq-AYPGJ2hr-w]
[2024-01-05T13:29:50,775][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:50,797][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000007][1] marking unavailable shards as stale: [1mgw9gZ2S9iEgASYjgnRUQ]
[2024-01-05T13:29:50,832][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:50,928][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,941][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000004][1] marking unavailable shards as stale: [MAf53uBNRlOagkrSXTmAtw]
[2024-01-05T13:29:50,965][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,055][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,056][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000010][0] marking unavailable shards as stale: [GQO3QLa0QM2mtv0E4Ojrsg]
[2024-01-05T13:29:51,083][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,083][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [bcpe-prod-enm-000001][1] marking unavailable shards as stale: [MyDShoQhQ9Gyq7-36ws8-g]
[2024-01-05T13:29:51,111][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,147][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,184][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000002][2] marking unavailable shards as stale: [BY32TIcvTNClupKDxvr5bA]
[2024-01-05T13:29:51,215][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,241][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.18-000007][0] marking unavailable shards as stale: [qHCFhkJDQPiINP_fS8dADg]
[2024-01-05T13:29:51,282][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,351][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.12-1][0] marking unavailable shards as stale: [NoPww9X1TySyHXyxQqexRg]
[2024-01-05T13:29:51,377][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,470][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [scip-exp-000001][2] marking unavailable shards as stale: [zT9dPwyOQKKS6e4R_f0SQg]
[2024-01-05T13:29:51,495][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,564][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:59,875][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [aaa-000001][0] marking unavailable shards as stale: [wNClhvoGQYiNDLrTaaRv-Q]
[2024-01-05T13:29:59,906][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:59,980][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,449][INFO ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] updating number_of_replicas to [6] for indices [.opendistro_security, .opensearch-sap-log-types-config]
[2024-01-05T13:30:53,449][INFO ][o.o.c.s.MasterService ] [clm-ent-os-master01.example.com] node-join[{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true} join existing leader], term: 11, version: 52227, delta: added {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}
[2024-01-05T13:30:53,568][INFO ][o.o.c.s.ClusterApplierService] [clm-ent-os-master01.example.com] added {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}, term: 11, version: 52227, reason: Publication{term=11, version=52227}
[2024-01-05T13:30:53,569][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Cluster node changed, node removed: false, node added: true
[2024-01-05T13:30:53,569][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] Node added: [0eBeInmKT_GpyI2Pyf7hzw]
[2024-01-05T13:30:53,570][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,570][INFO ][o.o.m.a.MLModelAutoReDeployer] [clm-ent-os-master01.example.com] Model auto reload configuration is false, not performing auto reloading!
[2024-01-05T13:30:53,571][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] Add data node to AD version hash ring: 0eBeInmKT_GpyI2Pyf7hzw
[2024-01-05T13:30:53,571][INFO ][o.o.a.c.HashRing ] [clm-ent-os-master01.example.com] All nodes with known AD version: {dKXWquQqS4eSIlocaQs8xA=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, 0eBeInmKT_GpyI2Pyf7hzw=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, PXe_3Kx1TDql7tnyfqj2iw=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, 8tcfPVTtQS-YjL9Rz3RrVg=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, XE6MBVc_QPihulr7v8nNkg=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, wEbSa1IgSWy7zFjnsNyvKw=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, K4F99P39SxunYOZZOkFMOA=ADNodeInfo{version=2.11.1, isEligibleDataNode=false}, _OvXnh2-QG6G-oUTjTjjqg=ADNodeInfo{version=2.11.1, isEligibleDataNode=false}, Qw0wCSWnREiDlX-2hG66dQ=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}}
[2024-01-05T13:30:53,572][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Hash ring build result: true
[2024-01-05T13:30:53,669][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:34:01,457][INFO ][o.o.j.s.JobSweeper ] [clm-ent-os-master01.example.com] Running full sweep
[2024-01-05T13:35:53,571][INFO ][o.o.i.i.PluginVersionSweepCoordinator] [clm-ent-os-master01.example.com] Canceling sweep ism plugin version job
[2024-01-05T13:38:21,305][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:49,834][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:49,856][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[epc-platform-lab-000002/S2DhZx81ReK1MUWNZd5xMQ]
[2024-01-05T13:23:49,865][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,060][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[clm-logstash-000001/Nl5sb5O3QuSHCxodHkTkTw]
[2024-01-05T13:23:50,069][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,164][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[epc-platform-lab-000009/OOopx63NTCSv7v8oqD-RZA]
[2024-01-05T13:23:50,170][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,431][INFO ][o.o.i.r.RecoverySourceHandler] [clm-ab-os-warm02.example.com] [epc-platform-lab-000004][1][recover to clm-ab-os-warm04.example.com] finalizing recovery took [6ms]
[2024-01-05T13:23:50,455][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:50,497][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.plugins-ml-config/WdgvpSKiQTqvzVi8V7izWA]
[2024-01-05T13:23:50,506][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,728][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[bcpe-prod-vnf-000001/tyssx8KVSjm-umZ8DdFrFQ]
[2024-01-05T13:23:50,740][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:50,841][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:50,856][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[scip-servmgmt-000001/OKuk_C5bTeaHS5NXuUem3g]
[2024-01-05T13:23:50,865][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:51,482][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,503][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[rpaserverlogs-scripts-000001/zy2xTpmnS0GRj8v8usZTwg]
[2024-01-05T13:23:51,512][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,602][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,617][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opendistro-ism-managed-index-history-2023.12.12-1/K9N6gqgOTm2SWJhUl5rC2w]
[2024-01-05T13:23:51,627][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,720][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:52,072][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:52,094][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opendistro-ism-managed-index-history-2023.12.13-000002/db8gbGbeSDqvmjaLc-Em1g]
[2024-01-05T13:23:52,104][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:52,188][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:53,135][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:24:33,515][INFO ][o.o.j.s.JobSweeper ] [clm-ab-os-warm02.example.com] Running full sweep
[2024-01-05T13:28:53,332][INFO ][o.o.c.c.Coordinator ] [clm-ab-os-warm02.example.com] cluster-manager node [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.OpenSearchException: node [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
at org.opensearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:320) ~[opensearch-2.11.1.jar:2.11.1]
.....
Caused by: org.opensearch.transport.RemoteTransportException: [clm-ent-os-master01.example.com][10.186.24.66:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}] has been removed from the cluster
at org.opensearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:220) ~[opensearch-2.11.1.jar:2.11.1]
[2024-01-05T13:28:53,335][INFO ][o.o.c.s.ClusterApplierService] [clm-ab-os-warm02.example.com] cluster-manager node changed {previous [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}], current []}, term: 11, version: 52181, reason: becoming candidate: onLeaderFailure
[2024-01-05T13:28:53,336][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:03,336][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:13,336][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:23,337][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:33,338][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:33,516][INFO ][o.o.j.s.JobSweeper ] [clm-ab-os-warm02.example.com] Running full sweep
[2024-01-05T13:29:43,339][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:52,440][WARN ][o.o.m.f.FsHealthService ] [clm-ab-os-warm02.example.com] health check of [/data/opensearch/data/nodes/0] took [122902ms] which is above the warn threshold of [5s] [2024-01-05T13:29:52,441][ERROR][o.o.m.f.FsHealthService ] [clm-ab-os-warm02.example.com] health check of [/data/opensearch/data/nodes/0] failed, took [122902ms] which is above the healthy threshold of [1m] [2024-01-05T13:29:52,444][WARN ][o.o.t.TransportService ] [clm-ab-os-warm02.example.com] Received response for a request that has timed out, sent [59845ms] ago, timed out [30023ms] ago, action [cluster:monitor/nodes/info[n]], node [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{temp=warm, zone=ab, shard_indexing_pressure_enabled=true}], id [42278325]
[2024-01-05T13:29:53,339][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0] [2024-01-05T13:30:03,340][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0] [2024-01-05T13:30:13,341][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0] [2024-01-05T13:30:23,342][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0] [2024-01-05T13:30:33,343][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0] [2024-01-05T13:30:43,343][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0] [2024-01-05T13:30:53,344][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] cluster-manager not discovered yet: have discovered [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{temp=warm, zone=ab, shard_indexing_pressure_enabled=true}, {clm-ab-os-warm04.example.com}{XE6MBVc_QPihulr7v8nNkg}{lSUxtDfKQr6K8EcfWReTHw}{10.186.24.83}{10.186.24.83:9300}{dimmls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, {clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}, {clm-ab-os-master01.example.com}{_OvXnh2-QG6G-oUTjTjjqg}{xiThbGH2QSeWmsobL9t6zQ}{10.186.24.76}{10.186.24.76:9300}{m}{zone=ab, temp=hot, shard_indexing_pressure_enabled=true}]; discovery will continue using [10.186.24.66:9300, 10.186.24.76:9300, 10.186.24.77:9300, 10.186.24.78:9300, 10.186.24.79:9300, 10.186.24.80:9300, 10.186.24.82:9300, 10.186.24.83:9300] from hosts
providers and [{clm-ab-os-warm04.example.com}{XE6MBVc_QPihulr7v8nNkg}{lSUxtDfKQr6K8EcfWReTHw}{10.186.24.83}{10.186.24.83:9300}{dimmls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, {clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}, {clm-ab-os-master01.example.com}{_OvXnh2-QG6G-oUTjTjjqg}{xiThbGH2QSeWmsobL9t6zQ}{10.186.24.76}{10.186.24.76:9300}{m}{zone=ab, temp=hot, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 11, last-accepted version 52181 in term 11 [2024-01-05T13:30:53,475][INFO ][o.o.c.s.ClusterApplierService] [clm-ab-os-warm02.example.com] cluster-manager node changed {previous [], current [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}]}, term: 11, version: 52227, reason: ApplyCommitRequest{term=11, version=52227, sourceNode={clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}} [2024-01-05T13:30:53,564][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,571][INFO ][o.o.d.PeerFinder ] [clm-ab-os-warm02.example.com] setting findPeersInterval to [1s] as node commission status = [true] for local node [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{temp=warm, zone=ab, shard_indexing_pressure_enabled=true}]
[2024-01-05T13:30:53,654][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opensearch-sap-log-types-config/pj2A9EJkRMayGS0xCbsS-w] [2024-01-05T13:30:53,665][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,682][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[aaa-000002/CJ3BZmDXT7yy9dCKzL_N7w]
[2024-01-05T13:30:53,694][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-..
[2024-01-05T13:34:33,516][INFO ][o.o.j.s.JobSweeper ] [clm-ab-os-warm02.example.com] Running full sweep
[2024-01-05T13:38:13,945][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [clm-ab-os-warm02.example.com] Exception during establishing a SSL connection: java.io.IOException: Connection timed out java.io.IOException: Connection timed out
at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:?]
[2024-01-05T13:38:21,301][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:38:22,818][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[scip-servmgmt-000001/OKuk_C5bTeaHS5NXuUem3g]
[2024-01-05T13:38:22,826][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:38:22,868][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:38:22,885][INFO ][o.o.p.PluginsService ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opendistro_security/MQ6u2yc7STy-mw90q88_Jw]