Logstash loses connection to OpenSearch periodically

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v2.11.1, Logstash OSS v8.9.2

Describe the issue:
I’m using Logstash with the OpenSearch output plugin.
It works fine most of the time, but every 15-30 minutes, Logstash loses connection to all my OpenSearch hosts, with the message (see logs below):

OpenSearch Unreachable: [https://user:xxxxxx@host:9200/][Manticore::SocketTimeout] Read timed out

Sometimes it can’t reconnect for 1-2 minutes, causing a huge queue buildup every time (I use RabbitMQ as buffer).

I see nothing out of the ordinary in any OpenSearch logs, just regular sweeps.
Right now I’m testing a single pipeline with about 5000 messages/sec, but this new OpenSearch cluster is supposed to handle >10x that.
I never had these issues when running a 3-node ODFE cluster with about 12k messages/s.

I tried pinging and rapidly connecting with nc to 9200 on the hosts, but I see no connection loss.

What could be causing this?
Are there some extra settings I’m missing?

Configuration:

output {
        opensearch {
                hosts => [ "https://log-ab-os-hot01.log.example.com:9200/", "https://log-ab-os-hot02.log.example.com:9200/", "https://log-ab-os-hot03.log.example.com:9200/", "https://log-ab-os-warm01.log.example.com:9200/", "https://log-ab-os-warm02.log.example.com:9200/", "https://log-ab-os-warm03.log.example.com:9200/", "https://log-ab-os-warm04.log.example.com:9200/" ]
                index => "aaa-alias"
                manage_template => false
                ssl => true
                user => logconsumer
                password => password
                cacert => "/data/logstash/certs/cacert.cer"
        }
}

Relevant Logs or Screenshots:

[2023-12-20T17:04:46,529][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-hot01.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,529][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm04.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-warm04.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm04.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,530][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm04.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:46,530][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:46,572][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm03.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-warm03.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm03.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,572][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm03.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:46,576][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,576][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:46,579][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm02.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-warm02.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm02.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,579][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm02.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:46,579][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot03.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-hot03.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot03.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,580][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot02.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-hot02.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot02.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:46,580][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot03.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:46,580][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-hot02.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:47,664][WARN ][logstash.outputs.opensearch][aaa][id] Marking url as dead. Last error: [LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError] OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out {:url=>https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/, :error_message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError"}
[2023-12-20T17:04:47,664][ERROR][logstash.outputs.opensearch][aaa][id] Attempted to send a bulk request but OpenSearch appears to be unreachable or down {:message=>"OpenSearch Unreachable: [https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/][Manticore::SocketTimeout] Read timed out", :exception=>LogStash::Outputs::OpenSearch::HttpClient::Pool::HostUnreachableError, :will_retry_in_seconds=>2}
[2023-12-20T17:04:47,959][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-hot01.log.example.com:9200/"}
[2023-12-20T17:04:47,965][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-hot02.log.example.com:9200/"}
[2023-12-20T17:04:47,971][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-hot03.log.example.com:9200/"}
[2023-12-20T17:04:47,986][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-warm01.log.example.com:9200/"}
[2023-12-20T17:04:47,990][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-warm02.log.example.com:9200/"}
[2023-12-20T17:04:47,994][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-warm03.log.example.com:9200/"}
[2023-12-20T17:04:47,999][WARN ][logstash.outputs.opensearch] Restored connection to OpenSearch instance {:url=>"https://logconsumer:xxxxxx@log-ab-os-warm04.log.example.com:9200/"}

RabbitMQ:

I’m curious what you see on the OpenSearch end. I’m thinking particularly about OpenSearch logs and metrics. The link here is for our own monitoring tool, but whatever you use should help shed some light.

If OpenSearch looks fine, then maybe the network is problematic or maybe Logstash is choked (you could monitor its JVM maybe for GC pauses? Maybe it needs more heap or something?).

So I did a bit of digging in the OpenSearch logs. It’s a bit hard to keep track of with 9 nodes, but here’s what I found.

First of all, I sometimes see huge waves (10-100) of these messages on all nodes:

[2024-01-05T13:38:22,884][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-master01.example.com] Detected cluster change event for destination migration

Not always, but sometimes it’s related to a node crash.
It seems that some nodes randomly fail a health check (by a lot). I posted a bigger log below, but I sometimes see messages like this:

[2024-01-05T13:29:52,440][WARN ][o.o.m.f.FsHealthService  ] [clm-ab-os-warm02.example.com] health check of [/data/opensearch/data/nodes/0] took [122902ms] which is above the warn threshold of [5s]

After this, the node leaves the cluster.
I don’t know what a health check does, but failing it by a factor of 24 seems bad.

What could be causing this?

Full logs for these two nodes (had to cut some of the spam from first example):

[2024-01-05T13:23:51,851][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:52,189][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:53,136][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration                                                                                                                                                                                                 [2024-01-05T13:24:01,456][INFO ][o.o.j.s.JobSweeper       ] [clm-ent-os-master01.example.com] Running full sweep                                                                                                                                                                                                                                                      [2024-01-05T13:28:30,408][WARN ][o.o.c.InternalClusterInfoService] [clm-ent-os-master01.example.com] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2024-01-05T13:28:50,324][INFO ][o.o.c.c.FollowersChecker ] [clm-ent-os-master01.example.com] FollowerChecker{discoveryNode={clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=3} health check failed                                                                                                                                                                                                                                                                                                                   org.opensearch.transport.RemoteTransportException: [clm-ab-os-warm02.example.com][10.186.24.81:9300][internal:coordination/fault_detection/follower_check]                                                                                                                                                                                                            Caused by: org.opensearch.cluster.coordination.NodeHealthCheckFailureException: handleFollowerCheck: node is unhealthy [healthy threshold breached], rejecting healthy threshold breached
        at org.opensearch.cluster.coordination.FollowersChecker.handleFollowerCheck(FollowersChecker.java:209) ~[opensearch-2.11.1.jar:2.11.1]
		...
		
[2024-01-05T13:28:50,326][INFO ][o.o.c.c.FollowersChecker ] [clm-ent-os-master01.example.com] FollowerChecker{discoveryNode={clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=1, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
[2024-01-05T13:28:50,328][INFO ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] updating number_of_replicas to [5] for indices [.opendistro_security, .opensearch-sap-log-types-config]
[2024-01-05T13:28:50,335][INFO ][o.o.c.s.MasterService    ] [clm-ent-os-master01.example.com] node-left[{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true} reason: health check failed], term: 11, version: 52182, delta: removed {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}
[2024-01-05T13:28:50,405][INFO ][o.o.c.s.ClusterApplierService] [clm-ent-os-master01.example.com] removed {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}, term: 11, version: 52182, reason: Publication{term=11, version=52182}
[2024-01-05T13:28:50,406][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Cluster node changed, node removed: true, node added: false
[2024-01-05T13:28:50,406][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] Node removed: [0eBeInmKT_GpyI2Pyf7hzw]
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] Remove data node from AD version hash ring: 0eBeInmKT_GpyI2Pyf7hzw
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Hash ring build result: true
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] Rebuild AD hash ring for realtime AD with cooldown, nodeChangeEvents size 2
[2024-01-05T13:28:50,407][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] Build AD version hash ring successfully
[2024-01-05T13:28:50,407][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:28:50,408][INFO ][o.o.c.r.DelayedAllocationService] [clm-ent-os-master01.example.com] scheduling reroute for delayed shards in [59.9s] (36 delayed shards)
[2024-01-05T13:28:50,410][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [aaa-000003][2] marking unavailable shards as stale: [YyxspbjDStCe8YshE06lUw]
[2024-01-05T13:28:50,410][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [aaa-000003][3] marking unavailable shards as stale: [SIiUbLgfTN-IxfYAzcYP8g]
[2024-01-05T13:28:50,433][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:28:50,433][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-job-scheduler-lock][0] marking unavailable shards as stale: [O4wt1sjFR0m_TF6xP976_A]
[2024-01-05T13:28:50,458][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:01,457][INFO ][o.o.j.s.JobSweeper       ] [clm-ent-os-master01.example.com] Running full sweep
[2024-01-05T13:29:02,355][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro_security][0] marking unavailable shards as stale: [BVS4-1ujQuydagP7Ry_OJg]
[2024-01-05T13:29:02,385][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:06,136][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opensearch-sap-log-types-config][0] marking unavailable shards as stale: [o6zKaCvqRBW2jDToHLzG7w]
[2024-01-05T13:29:06,164][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,329][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,371][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.23-000012][0] marking unavailable shards as stale: [AJy5zLVHSGGaYJMnLgCRFw]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000002][3] marking unavailable shards as stale: [NuqWQEdKQMSr1GZ16JXWXQ]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000003][1] marking unavailable shards as stale: [0MRy1AnORVilVo0kKF4A5g]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000003][0] marking unavailable shards as stale: [k3yL9qOcTiWd_2dXCkwLWQ]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000001][1] marking unavailable shards as stale: [QfjZtkreSn2C44gmft5dXQ]
[2024-01-05T13:29:50,372][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000009][0] marking unavailable shards as stale: [NsDc36TcRbiLi7oEweKoXA]
[2024-01-05T13:29:50,374][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,401][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,402][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.26-000015][0] marking unavailable shards as stale: [ZOl_x67RTTWOv-TgKoTc7Q]
[2024-01-05T13:29:50,402][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000008][0] marking unavailable shards as stale: [jbev-RWyQPKU5kM9UOQmQw]
[2024-01-05T13:29:50,402][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opensearch-notifications-config][0] marking unavailable shards as stale: [NEGso0NRSbeqLBdEPb9rmg]
[2024-01-05T13:29:50,404][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,424][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,426][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,461][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,483][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:50,557][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [scip-platmgmt-000001][5] marking unavailable shards as stale: [mNTJL41KSwSVaEkqPdc9Ew]
[2024-01-05T13:29:50,557][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.13-000002][0] marking unavailable shards as stale: [etoOPJUcSGCn6HCIlr07Lg]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000001][3] marking unavailable shards as stale: [9xpuGeddSH69En9_zGIkrA]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [bcpe-lab-evnfm-000001][0] marking unavailable shards as stale: [O3VBV9LTQaOXXl45xlnn1w]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [bcpe-prod-evnfm-000001][1] marking unavailable shards as stale: [roN79T7HQ6GcofOGZAsHIg]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000002][0] marking unavailable shards as stale: [A9SYNDlZTQioZ_aL2dlYag]
[2024-01-05T13:29:50,558][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [rpaserverlogs-scripts-000001][0] marking unavailable shards as stale: [7I4NEnqNRdqImOBT5IDV1Q]
[2024-01-05T13:29:50,590][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,592][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,616][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,617][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2024.01.03-000023][0] marking unavailable shards as stale: [WnlvYuApR6CpCgdOU5G9NA]
[2024-01-05T13:29:50,618][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [clm-ent-os-master01.example.com] Cannot move any shard in the cluster as there is no node on which shards can be allocated. Skipping shard iteration
[2024-01-05T13:29:50,652][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,679][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,680][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [clm-logstash-000001][0] marking unavailable shards as stale: [JgJ2uLCsQoKOo_tfgIlEvA]
[2024-01-05T13:29:50,717][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,741][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,741][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.plugins-ml-config][0] marking unavailable shards as stale: [oaW_oAyrSbCZk_5-c-H70g]
[2024-01-05T13:29:50,742][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opensearch-observability][0] marking unavailable shards as stale: [wRFWTpsgTjq-AYPGJ2hr-w]
[2024-01-05T13:29:50,775][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:50,797][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000007][1] marking unavailable shards as stale: [1mgw9gZ2S9iEgASYjgnRUQ]
[2024-01-05T13:29:50,832][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:50,928][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:50,941][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000004][1] marking unavailable shards as stale: [MAf53uBNRlOagkrSXTmAtw]
[2024-01-05T13:29:50,965][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,055][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,056][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-lab-000010][0] marking unavailable shards as stale: [GQO3QLa0QM2mtv0E4Ojrsg]
[2024-01-05T13:29:51,083][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,083][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [bcpe-prod-enm-000001][1] marking unavailable shards as stale: [MyDShoQhQ9Gyq7-36ws8-g]
[2024-01-05T13:29:51,111][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,147][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,184][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [epc-platform-prod-000002][2] marking unavailable shards as stale: [BY32TIcvTNClupKDxvr5bA]
[2024-01-05T13:29:51,215][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,241][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.18-000007][0] marking unavailable shards as stale: [qHCFhkJDQPiINP_fS8dADg]
[2024-01-05T13:29:51,282][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,351][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [.opendistro-ism-managed-index-history-2023.12.12-1][0] marking unavailable shards as stale: [NoPww9X1TySyHXyxQqexRg]
[2024-01-05T13:29:51,377][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:29:51,470][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [scip-exp-000001][2] marking unavailable shards as stale: [zT9dPwyOQKKS6e4R_f0SQg]
[2024-01-05T13:29:51,495][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:51,564][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:59,875][WARN ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] [aaa-000001][0] marking unavailable shards as stale: [wNClhvoGQYiNDLrTaaRv-Q]
[2024-01-05T13:29:59,906][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:59,980][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,449][INFO ][o.o.c.r.a.AllocationService] [clm-ent-os-master01.example.com] updating number_of_replicas to [6] for indices [.opendistro_security, .opensearch-sap-log-types-config]
[2024-01-05T13:30:53,449][INFO ][o.o.c.s.MasterService    ] [clm-ent-os-master01.example.com] node-join[{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true} join existing leader], term: 11, version: 52227, delta: added {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}
[2024-01-05T13:30:53,568][INFO ][o.o.c.s.ClusterApplierService] [clm-ent-os-master01.example.com] added {{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}}, term: 11, version: 52227, reason: Publication{term=11, version=52227}
[2024-01-05T13:30:53,569][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Cluster node changed, node removed: false, node added: true
[2024-01-05T13:30:53,569][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] Node added: [0eBeInmKT_GpyI2Pyf7hzw]
[2024-01-05T13:30:53,570][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,570][INFO ][o.o.m.a.MLModelAutoReDeployer] [clm-ent-os-master01.example.com] Model auto reload configuration is false, not performing auto reloading!
[2024-01-05T13:30:53,571][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] Add data node to AD version hash ring: 0eBeInmKT_GpyI2Pyf7hzw
[2024-01-05T13:30:53,571][INFO ][o.o.a.c.HashRing         ] [clm-ent-os-master01.example.com] All nodes with known AD version: {dKXWquQqS4eSIlocaQs8xA=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, 0eBeInmKT_GpyI2Pyf7hzw=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, PXe_3Kx1TDql7tnyfqj2iw=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, 8tcfPVTtQS-YjL9Rz3RrVg=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, XE6MBVc_QPihulr7v8nNkg=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, wEbSa1IgSWy7zFjnsNyvKw=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}, K4F99P39SxunYOZZOkFMOA=ADNodeInfo{version=2.11.1, isEligibleDataNode=false}, _OvXnh2-QG6G-oUTjTjjqg=ADNodeInfo{version=2.11.1, isEligibleDataNode=false}, Qw0wCSWnREiDlX-2hG66dQ=ADNodeInfo{version=2.11.1, isEligibleDataNode=true}}
[2024-01-05T13:30:53,572][INFO ][o.o.a.c.ADClusterEventListener] [clm-ent-os-master01.example.com] Hash ring build result: true
[2024-01-05T13:30:53,669][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:34:01,457][INFO ][o.o.j.s.JobSweeper       ] [clm-ent-os-master01.example.com] Running full sweep
[2024-01-05T13:35:53,571][INFO ][o.o.i.i.PluginVersionSweepCoordinator] [clm-ent-os-master01.example.com] Canceling sweep ism plugin version job
[2024-01-05T13:38:21,305][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ent-os-master01.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:49,834][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:49,856][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[epc-platform-lab-000002/S2DhZx81ReK1MUWNZd5xMQ]
[2024-01-05T13:23:49,865][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,060][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[clm-logstash-000001/Nl5sb5O3QuSHCxodHkTkTw]
[2024-01-05T13:23:50,069][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,164][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[epc-platform-lab-000009/OOopx63NTCSv7v8oqD-RZA]
[2024-01-05T13:23:50,170][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,431][INFO ][o.o.i.r.RecoverySourceHandler] [clm-ab-os-warm02.example.com] [epc-platform-lab-000004][1][recover to clm-ab-os-warm04.example.com] finalizing recovery took [6ms]
[2024-01-05T13:23:50,455][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:50,497][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.plugins-ml-config/WdgvpSKiQTqvzVi8V7izWA]
[2024-01-05T13:23:50,506][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:50,728][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[bcpe-prod-vnf-000001/tyssx8KVSjm-umZ8DdFrFQ]
[2024-01-05T13:23:50,740][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:50,841][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:50,856][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[scip-servmgmt-000001/OKuk_C5bTeaHS5NXuUem3g]
[2024-01-05T13:23:50,865][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:51,482][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,503][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[rpaserverlogs-scripts-000001/zy2xTpmnS0GRj8v8usZTwg]
[2024-01-05T13:23:51,512][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,602][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,617][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opendistro-ism-managed-index-history-2023.12.12-1/K9N6gqgOTm2SWJhUl5rC2w]
[2024-01-05T13:23:51,627][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:51,720][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
...
[2024-01-05T13:23:52,072][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:52,094][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opendistro-ism-managed-index-history-2023.12.13-000002/db8gbGbeSDqvmjaLc-Em1g]
[2024-01-05T13:23:52,104][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:52,188][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:23:53,135][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:24:33,515][INFO ][o.o.j.s.JobSweeper       ] [clm-ab-os-warm02.example.com] Running full sweep
[2024-01-05T13:28:53,332][INFO ][o.o.c.c.Coordinator      ] [clm-ab-os-warm02.example.com] cluster-manager node [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}] failed, restarting discovery
org.opensearch.OpenSearchException: node [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
        at org.opensearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:320) ~[opensearch-2.11.1.jar:2.11.1]
		.....
		
Caused by: org.opensearch.transport.RemoteTransportException: [clm-ent-os-master01.example.com][10.186.24.66:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}] has been removed from the cluster
        at org.opensearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:220) ~[opensearch-2.11.1.jar:2.11.1]

[2024-01-05T13:28:53,335][INFO ][o.o.c.s.ClusterApplierService] [clm-ab-os-warm02.example.com] cluster-manager node changed {previous [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}], current []}, term: 11, version: 52181, reason: becoming candidate: onLeaderFailure
[2024-01-05T13:28:53,336][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:29:03,336][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:13,336][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:23,337][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:33,338][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:33,516][INFO ][o.o.j.s.JobSweeper       ] [clm-ab-os-warm02.example.com] Running full sweep
[2024-01-05T13:29:43,339][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: healthy threshold breached
[2024-01-05T13:29:52,440][WARN ][o.o.m.f.FsHealthService  ] [clm-ab-os-warm02.example.com] health check of [/data/opensearch/data/nodes/0] took [122902ms] which is above the warn threshold of [5s]                                                                                                                                                                  [2024-01-05T13:29:52,441][ERROR][o.o.m.f.FsHealthService  ] [clm-ab-os-warm02.example.com] health check of [/data/opensearch/data/nodes/0] failed, took [122902ms] which is above the healthy threshold of [1m]                                                                                                                                                       [2024-01-05T13:29:52,444][WARN ][o.o.t.TransportService   ] [clm-ab-os-warm02.example.com] Received response for a request that has timed out, sent [59845ms] ago, timed out [30023ms] ago, action [cluster:monitor/nodes/info[n]], node [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{temp=warm, zone=ab, shard_indexing_pressure_enabled=true}], id [42278325]
[2024-01-05T13:29:53,339][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0]                                                                                                                                                                                 [2024-01-05T13:30:03,340][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0]                                                                                                                                                                                 [2024-01-05T13:30:13,341][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0]                                                                                                                                                                                 [2024-01-05T13:30:23,342][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0]                                                                                                                                                                                 [2024-01-05T13:30:33,343][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0]                                                                                                                                                                                 [2024-01-05T13:30:43,343][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] this node is unhealthy: health check failed on [/data/opensearch/data/nodes/0]                                                                                                                                                                                 [2024-01-05T13:30:53,344][WARN ][o.o.c.c.ClusterFormationFailureHelper] [clm-ab-os-warm02.example.com] cluster-manager not discovered yet: have discovered [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{temp=warm, zone=ab, shard_indexing_pressure_enabled=true}, {clm-ab-os-warm04.example.com}{XE6MBVc_QPihulr7v8nNkg}{lSUxtDfKQr6K8EcfWReTHw}{10.186.24.83}{10.186.24.83:9300}{dimmls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, {clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}, {clm-ab-os-master01.example.com}{_OvXnh2-QG6G-oUTjTjjqg}{xiThbGH2QSeWmsobL9t6zQ}{10.186.24.76}{10.186.24.76:9300}{m}{zone=ab, temp=hot, shard_indexing_pressure_enabled=true}]; discovery will continue using [10.186.24.66:9300, 10.186.24.76:9300, 10.186.24.77:9300, 10.186.24.78:9300, 10.186.24.79:9300, 10.186.24.80:9300, 10.186.24.82:9300, 10.186.24.83:9300] from hosts
providers and [{clm-ab-os-warm04.example.com}{XE6MBVc_QPihulr7v8nNkg}{lSUxtDfKQr6K8EcfWReTHw}{10.186.24.83}{10.186.24.83:9300}{dimmls}{zone=ab, temp=warm, shard_indexing_pressure_enabled=true}, {clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}, {clm-ab-os-master01.example.com}{_OvXnh2-QG6G-oUTjTjjqg}{xiThbGH2QSeWmsobL9t6zQ}{10.186.24.76}{10.186.24.76:9300}{m}{zone=ab, temp=hot, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 11, last-accepted version 52181 in term 11                                                                                [2024-01-05T13:30:53,475][INFO ][o.o.c.s.ClusterApplierService] [clm-ab-os-warm02.example.com] cluster-manager node changed {previous [], current [{clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}]}, term: 11, version: 52227, reason: ApplyCommitRequest{term=11, version=52227, sourceNode={clm-ent-os-master01.example.com}{K4F99P39SxunYOZZOkFMOA}{H40NiDceSaelFesqkkKrnA}{10.186.24.66}{10.186.24.66:9300}{m}{zone=ent, temp=hot, shard_indexing_pressure_enabled=true}}                                                                                                                       [2024-01-05T13:30:53,564][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,571][INFO ][o.o.d.PeerFinder         ] [clm-ab-os-warm02.example.com] setting findPeersInterval to [1s] as node commission status = [true] for local node [{clm-ab-os-warm02.example.com}{0eBeInmKT_GpyI2Pyf7hzw}{tCh6ZOinTJ-OnnHPFqdmPQ}{10.186.24.81}{10.186.24.81:9300}{dimls}{temp=warm, zone=ab, shard_indexing_pressure_enabled=true}]
[2024-01-05T13:30:53,654][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opensearch-sap-log-types-config/pj2A9EJkRMayGS0xCbsS-w]                                                                                                                                                                                [2024-01-05T13:30:53,665][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:30:53,682][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[aaa-000002/CJ3BZmDXT7yy9dCKzL_N7w]
[2024-01-05T13:30:53,694][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-..
[2024-01-05T13:34:33,516][INFO ][o.o.j.s.JobSweeper       ] [clm-ab-os-warm02.example.com] Running full sweep
[2024-01-05T13:38:13,945][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [clm-ab-os-warm02.example.com] Exception during establishing a SSL connection: java.io.IOException: Connection timed out                                                                                                                                                            java.io.IOException: Connection timed out
        at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:?]

[2024-01-05T13:38:21,301][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:38:22,818][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[scip-servmgmt-000001/OKuk_C5bTeaHS5NXuUem3g]
[2024-01-05T13:38:22,826][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:38:22,868][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [clm-ab-os-warm02.example.com] Detected cluster change event for destination migration
[2024-01-05T13:38:22,885][INFO ][o.o.p.PluginsService     ] [clm-ab-os-warm02.example.com] PluginService:onIndexModule index:[.opendistro_security/MQ6u2yc7STy-mw90q88_Jw]

I know it’s a late follow-up but I think that we know at this point that it’s OpenSearch that’s failing for some reason. We see in the logs above that node 0eBeInmKT_GpyI2Pyf7hzw is removed from the cluster. Typically it’s when the cluster coordinator fails to ping it for a number of times. It would be interesting to see the metrics and logs of that node in particular.