Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): 3.2
Describe the issue:CCR bootstrap starts successfully and begins transferring shard segment files from the leader to the follower cluster.
- For some large shards, file transfer fails with a
CorruptIndexException (or related index corruption error).
- The affected follower shard transitions to a failed state.
- CCR replication for the shard does not automatically recover and requires manual intervention.
- In large clusters with hundreds or thousands of shards, even a small number of shard failures can prevent successful completion of the bootstrap process.
Configuration:
Relevant Logs or Screenshots:
@gchakkalakkal1 This looks like it could be a known CCR bug (#1465, #1482): leader-side segment reads used a Lucene IndexInput opened with IOContext.READONCE, which on newer JDKs backs onto a thread-confined memory segment. When a large shard’s transfer spans multiple chunk requests handled by different threads, accessing it cross-thread throws IllegalStateException: confined, which can surface as a corrupt/incomplete file. Fixed upstream in PR #1520, merged April 2025 and included in the official 3.2.0.0 plugin release.
Two things that would help confirm:
- Does your actual stack trace mention
IllegalStateException: confined or MemorySegmentIndexInput?
- Are you on the official OpenSearch distribution, or an AWS Managed Service? Managed offerings have shipped older CCR plugin builds under a given version label even after the upstream fix landed.