Unable to restart replication after stopping

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 1.3.2

Describe the issue:
My issue is similar to this issue

After a replication job as been paused for a number of hours, stopping it does not delete the task and when attempting to recreate, it states a task is already running.

I do see several paused replication tasks when querying the tasks:
“action”: "indices:admin/plugins/replication/index/pause

However, they’re not cancellable.
“cancellable”: false,

Configuration:

Relevant Logs or Screenshots:
{“error”:{“root_cause”:[{“type”:“resource_already_exists_exception”,“reason”:“task with id {replication:index:index_v1} already exist”}],“type”:“resource_already_exists_exception”,“reason”:“task with id {replication:index:indesx_v1} already exist”},“status”:400

What can i do to restart these replication jobs. This occurs both on single index replication and on autofollow replication.

Thanks!

Can you try killing the tasks using tasks API?

Yes, but they’re not cacellable:

{“type”:“failed_node_exception”,“reason”:"Failed node
doesn’t support cancellation

Weird! We’ve been able to cancel the replication related tasks with POST _tasks/<task_id>/_cancel

@ankikala
Any advice. I am trying to start replication with

PUT /_plugins/_replication/eddie-aws-000002/_start?pretty
{   "leader_alias": "connection-to-aws",   "leader_index": "eddie-aws-000002",   "use_roles":{     "leader_cluster_role":"all_access","follower_cluster_role":"all_access"  }}

I Dashbaord I see error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parse_exception",
        "reason" : "request body or source parameter is required"
      }
    ],
    "type" : "parse_exception",
    "reason" : "request body or source parameter is required"
  },
  "status" : 400
}

In log I see errors:

[WARN ][o.o.p.PersistentTasksClusterService] [pesmaster-node2] persistent task replication:index:eddie-aws-000002 failed
Jan  5 12:28:54 pesmaster02-spc pesmaster-node2[1664]: java.lang.IllegalArgumentException: this node does not have the remote_cluster_client role
[WARN ][o.o.c.s.ClusterApplierService] [pesmaster-node2] failed to notify ClusterStateListener
Jan  5 12:28:54 pesmaster02-spc pesmaster-node2[1664]: java.lang.IllegalStateException: p must not be null
[ERROR][o.o.r.a.i.TransportReplicateIndexMasterNodeAction] [pesmaster-node2] Failed to trigger replication for eddie-aws-000002 - java.lang.IllegalStateException: Timed out when waiting for persistent task after 30s

I have remote cluster client on other side and I replicate one index, but can´t replicate one more…

@vnovotny98 Looks like the request and logs are not related

Regarding the request, it seems that it is not constructed currently based on the 400 error. If you trying from dev tools, make sure that there are no extra line breaks or try to execute using curl.

Regarding the logs,

java.lang.IllegalArgumentException: this node does not have the remote_cluster_client role

It looks like the remote cluster is not setup correctly.
If you’ve overridden node.roles in opensearch.yml on the follower cluster, make sure it also includes the remote_cluster_client role. Reference: Redirecting…
Also, verify the cluster settings for the remote cluster settings.