Error on resume replication

Hi,

I have found something weird using 1.2.0.

I start the replication, everything works fine but if i pause it, when i resume it i got an error and if i check the status, looks like it’s running but the follower checkpoint does not move forward.

[opensearch@opensearch-replica-master-2 ~]$ curl -XPOST -k -H 'Content-Type: application/json' 'http://admin:xxx@localhost:9200/_plugins/_replication/cadence-visibility/_pause?pretty' -d '{}'
{
  "acknowledged" : true
}

[opensearch@opensearch-replica-master-2 ~]$ curl -XGET -u admin:xxx 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "PAUSED",
  "reason" : "User initiated",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility"
}

Some minutes later...

[opensearch@opensearch-replica-master-2 ~]$ curl -XPOST -k -H 'Content-Type: application/json' 'http://admin:xxx@localhost:9200/_plugins/_replication/cadence-visibility/_resume?pretty' -d '{}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "resource_already_exists_exception",
        "reason" : "task with id {replication:index:cadence-visibility} already exist"
      }
    ],
    "type" : "resource_already_exists_exception",
    "reason" : "task with id {replication:index:cadence-visibility} already exist"
  },
  "status" : 400
}

[opensearch@opensearch-replica-master-2 ~]$ curl -XGET -u admin:xxx 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility",
  "syncing_details" : {
    "leader_checkpoint" : 765518,
    "follower_checkpoint" : 764450,
    "seq_no" : 764450
  }
}

I tried with 2 differents clusters running 1.2.0 and the results are the same.

After restarting the coordinator node, the follower checkpoint starts to move…

[opensearch@opensearch-replica-master-2 ~]$ curl -u admin:admin localhost:9200/_cat/nodes
10.196.38.7  43 79 3 0.51 0.59 0.66 dimr - opensearch-replica-master-1
10.196.2.144 14 85 3 0.81 1.05 1.52 dimr * opensearch-replica-master-0
10.196.43.35 39 77 3 1.54 2.06 2.08 dimr - opensearch-replica-master-2

❯ kubectl delete po opensearch-replica-master-0 -n cadence-test

opensearch@opensearch-replica-master-2 ~]$ curl -XGET -u admin:admin 'http://localhost:9200/_plugins/_replication/cadence-visibility/_status?pretty'
{
  "status" : "SYNCING",
  "reason" : "User initiated",
  "leader_alias" : "master",
  "leader_index" : "cadence-visibility",
  "follower_index" : "cadence-visibility",
  "syncing_details" : {
    "leader_checkpoint" : 765518,
    "follower_checkpoint" : 765518,
    "seq_no" : 765518
  }
}

saikaranam searchymcsearchface Could you please take a look to it?

Let’s broaden this out a bit @ccr-devs .

Also, make sure you update to 1.2.1 soon.

1 Like

@stdmje Would it be possible to share the logs from the node where task with id - replication:index:cadence-visibility is running to debug further?

  • we can get the running tasks from _cat/tasks?v API. It should have information regarding the node

Now everything works as expected. Tomorrow i will setup a new staging environment so i will let you know.

I am not able to reproduce the error again.

If i get the error again i will let you know.

Thanks for the support!

1 Like