CCR bootstrap of large clusters

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 3.2

Describe the issue:CCR bootstrap starts successfully and begins transferring shard segment files from the leader to the follower cluster.

  1. For some large shards, file transfer fails with a CorruptIndexException (or related index corruption error).
  2. The affected follower shard transitions to a failed state.
  3. CCR replication for the shard does not automatically recover and requires manual intervention.
  4. In large clusters with hundreds or thousands of shards, even a small number of shard failures can prevent successful completion of the bootstrap process.

Configuration:

Relevant Logs or Screenshots:

@gchakkalakkal1 This looks like it could be a known CCR bug (#1465, #1482): leader-side segment reads used a Lucene IndexInput opened with IOContext.READONCE, which on newer JDKs backs onto a thread-confined memory segment. When a large shard’s transfer spans multiple chunk requests handled by different threads, accessing it cross-thread throws IllegalStateException: confined, which can surface as a corrupt/incomplete file. Fixed upstream in PR #1520, merged April 2025 and included in the official 3.2.0.0 plugin release.

Two things that would help confirm:

  1. Does your actual stack trace mention IllegalStateException: confined or MemorySegmentIndexInput?
  2. Are you on the official OpenSearch distribution, or an AWS Managed Service? Managed offerings have shipped older CCR plugin builds under a given version label even after the upstream fix landed.