Cross-Cluster Replication via NAT in OpenSearch 2.17.1

Version
2.17.1
Plugins installed:
all-nodes opensearch-alerting 2.17.1.0
all-nodes opensearch-anomaly-detection 2.17.1.0
all-nodes opensearch-asynchronous-search 2.17.1.0
all-nodes opensearch-cross-cluster-replication 2.17.1.0
all-nodes opensearch-custom-codecs 2.17.1.0
all-nodes opensearch-flow-framework 2.17.1.0
all-nodes opensearch-geospatial 2.17.1.0
all-nodes opensearch-index-management 2.17.1.0
all-nodes opensearch-job-scheduler 2.17.1.0
all-nodes opensearch-knn 2.17.1.0
all-nodes opensearch-ml 2.17.1.0
all-nodes opensearch-neural-search 2.17.1.0
all-nodes opensearch-notifications 2.17.1.0
all-nodes opensearch-notifications-core 2.17.1.0
all-nodes opensearch-observability 2.17.1.0
all-nodes opensearch-performance-analyzer 2.17.1.0
all-nodes opensearch-reports-scheduler 2.17.1.0
all-nodes opensearch-security 2.17.1.0
all-nodes opensearch-security-analytics 2.17.1.0
all-nodes opensearch-skills 2.17.1.0
all-nodes opensearch-sql 2.17.1.0
all-nodes opensearch-system-templates 2.17.1.0
all-nodes prometheus-exporter 2.17.1.0
all-nodes query-insights 2.17.1.0

Dear OpenSearch Support,

We are experiencing issues with cross-cluster replication (CCR). We are currently on OpenSearch version 2.17.1.

Cluster Setup

We have two OpenSearch clusters:

  • Primary cluster: 15 nodes (3 master, 6 hot, 3 warm, 3 cold)
  • Secondary cluster: 5 nodes (1 master, 2 hot, 1 warm, 1 cold)

Each group of node types is on its own VLAN (Master VLAN, Hot VLAN, etc.).
The two clusters are permanently running and reachable over NAT, where:

  • Port 9200 and 9300 are open between corresponding node types:

    • master primar ↔ master secondary
    • hot primar ↔ hot secondary
    • warm primar ↔ warm secondary
    • cold primar ↔ cold secondary
  • NAT maps external IPs for primary nodes (e.g. nat-master1-ip:9300)

  • Nodes in both clusters use the same internal IPs/hostnames, so NAT is required to resolve between them.

Each node is configured with the same network.publish_host and network.host.

Problem Description

We want to replicate data from the primary cluster to the secondary cluster.

We created appropriate users and roles for replication.

We attempted two configurations on the secondary cluster:

1. Seed mode

PUT /_cluster/settings?pretty
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "connection-to-primar" : {
          "seeds" : [
            "nat-master1-ip:9300",
            "nat-master2-ip:9300",
            "nat-master3-ip:9300"
          ],
          "transport.compress": true
        }
      }
    }
  }
}

→ Result:

"num_nodes_connected": 0,
"max_connections_per_cluster": 3

2. Proxy mode

PUT /_cluster/settings?pretty
{
  "persistent": {
    "cluster": {
      "remote": {
        "connection-to-primar-proxy": {
          "mode": "proxy",
          "proxy_address": "nat-master1-ip:9300",
          "transport.compress": true
        }
      }
    }
  }
}

→ Result:

"num_proxy_sockets_connected": 18,
"max_proxy_socket_connections": 18

With proxy mode, the connection appears to be established.

Replication Attempt

When we trigger replication, authentication passes, and in logs we see:

  • On primary cluster:
Replication setup - Permissions validation successful for Index
  • On secondary cluster:
Failed to trigger replication for xxx-test-000006 - ResourceAlreadyExistsException[task with id {replication:index:xxx-test-000006} already exists]

However, the index is not created, and no data is being replicated.

My Main Question

Do all nodes in the secondary cluster need to have direct access to nat-master1-ip:9300 (or other seed/proxy nodes on the leader)?
Or is it sufficient that only the master nodes are connected?

We appreciate your help and any clarification you can provide.

Kind regards,
Vojtech

any advice?