Hello,
I have found that sometimes, if a replica node restarts when applying changes from the master, the replication fails.
Since i am running dev and staging environments in Preemptible k8s nodes, this is really annoying.
Any suggestions about how to avoid this failures?
Thanks in advance
[opensearch@opensearch-replica-master-2 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
"status" : "FAILED",
"reason" : "",
"leader_alias" : "master",
"leader_index" : "cadence-visibility",
"follower_index" : "cadence-visibility"
}
Logs:
[2021-12-15T16:09:17,745][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,749][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,749][WARN ][o.o.r.m.s.ReplicationMetadataStore] [opensearch-replica-master-2] Encountered a failure while executing in org.opensearch.action.admin.cluster.health.ClusterHealthRequest@39ba4ba7. Retrying in 10 seconds.
[2021-12-15T16:09:17,784][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,784][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,785][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,786][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,789][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,790][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,791][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,791][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][4]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][4], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=0QcakcUYQUe27BUo9bZ8_Q]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,792][WARN ][o.o.r.a.r.TransportReplayChangesAction] [opensearch-replica-master-2] [[cadence-visibility][3]] failed to perform indices:data/write/plugins/replication/changes on replica [cadence-visibility][3], node[MSw0V5lyQxCFTS_RlfWPWg], [R], s[STARTED], a[id=I3EwkNNuRFKgo9Q4uhBbvQ]
at org.opensearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1298) [opensearch-1.2.0.jar:1.2.0]
at org.opensearch.action.support.replication.ReplicationOperation$3.tryAction(ReplicationOperation.java:306) [opensearch-1.2.0.jar:1.2.0]
Suppressed: org.opensearch.transport.NodeDisconnectedException: [opensearch-replica-master-1][10.196.57.16:9300][indices:data/write/plugins/replication/changes[r]] disconnected
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@64df2a3a}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@70679a75}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@509844e2}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@2444cded}
[2021-12-15T16:09:17,924][ERROR][o.o.r.t.s.TranslogSequencer] [opensearch-replica-master-2] [cadence-visibility][3] Failed replaying changes. Failure:0:org.opensearch.action.support.replication.ReplicationResponse$ShardInfo$Failure@21dda8c5}
[2021-12-15T16:09:17,924][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#10]: Unable to get changes from seqNo: 392578. kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=StandaloneCoroutine{Cancelling}@5f4d1f7c
Caused by: ReplicationException[failed to replay changes]
at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
[2021-12-15T16:09:17,925][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#2]: Unable to get changes from seqNo: 393171. kotlinx.coroutines.JobCancellationException: Parent job is Cancelling; job=StandaloneCoroutine{Cancelling}@532ee7
Caused by: ReplicationException[failed to replay changes]
at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,926][ERROR][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#4]: ShardReplicationTask: Caught downstream exception ReplicationException[failed to replay changes]
at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,926][ERROR][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#3]: ShardReplicationTask: Caught downstream exception ReplicationException[failed to replay changes]
at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,927][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#5]: Going to mark ShardReplicationTask as Failed with ReplicationException[failed to replay changes]
at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:17,927][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#9]: Going to mark ShardReplicationTask as Failed with ReplicationException[failed to replay changes]
at org.opensearch.replication.task.shard.TranslogSequencer$sequencer$1$1.invokeSuspend(TranslogSequencer.kt:85)
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
Suppressed: ReplicationException[failed to replay changes]
[2021-12-15T16:09:18,233][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#1]: Waiting 600000 millis for IndexReplicationTask to respond to failure of shard task
[2021-12-15T16:09:18,322][INFO ][o.o.r.a.p.TransportPauseIndexReplicationAction] [opensearch-replica-master-2] Pausing index replication on index:cadence-visibility
[2021-12-15T16:09:18,364][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#8]: Waiting 600000 millis for IndexReplicationTask to respond to failure of shard task
[2021-12-15T16:09:18,433][WARN ][o.o.c.r.a.AllocationService] [opensearch-replica-master-2] [.replication-metadata-store][0] marking unavailable shards as stale: [h6iKCCCiShadAlL546t8hQ]
[2021-12-15T16:09:18,737][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][clusterApplierService#updateTask][T#1]: Pause state received for index cadence-visibility. Cancelling [cadence-visibility][3] task
[2021-12-15T16:09:18,737][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][clusterApplierService#updateTask][T#1]: Pause state received for index cadence-visibility. Cancelling [cadence-visibility][4] task
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] opensearch[opensearch-replica-master-2][replication_follower][T#6]: Received cancellation of ShardReplicationTask java.util.concurrent.CancellationException: Shard replication task received pause.
at org.opensearch.replication.task.CrossClusterReplicationTask.cancelTask(CrossClusterReplicationTask.kt:87)
at org.opensearch.replication.task.shard.ShardReplicationTask.access$cancelTask(ShardReplicationTask.kt:60)
at org.opensearch.replication.task.shard.ShardReplicationTask$ClusterStateListenerForTaskInterruption.clusterChanged(ShardReplicationTask.kt:187)
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] Going to mark ShardReplicationTask:146988 task as completed
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] opensearch[opensearch-replica-master-2][replication_follower][T#7]: Received cancellation of ShardReplicationTask java.util.concurrent.CancellationException: Shard replication task received pause.
at org.opensearch.replication.task.CrossClusterReplicationTask.cancelTask(CrossClusterReplicationTask.kt:87)
at org.opensearch.replication.task.shard.ShardReplicationTask.access$cancelTask(ShardReplicationTask.kt:60)
at org.opensearch.replication.task.shard.ShardReplicationTask$ClusterStateListenerForTaskInterruption.clusterChanged(ShardReplicationTask.kt:187)
[2021-12-15T16:09:18,738][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] Going to mark ShardReplicationTask:147000 task as completed
[2021-12-15T16:09:18,806][INFO ][o.o.r.a.p.TransportPauseIndexReplicationAction] [opensearch-replica-master-2] Pausing index replication on index:cadence-visibility
[2021-12-15T16:09:19,033][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][3] Successfully persisted task status
[2021-12-15T16:09:19,033][INFO ][o.o.r.t.s.ShardReplicationTask] [opensearch-replica-master-2] [cadence-visibility][4] Successfully persisted task status
[2021-12-15T16:09:20,463][WARN ][o.o.p.PersistentTasksClusterService] [opensearch-replica-master-2] persistent task replication:index:cadence-visibility failed
at org.opensearch.replication.task.index.IndexReplicationTask$failReplication$2.invokeSuspend(IndexReplicationTask.kt:278) ~[?:?]