Migration via Snapshot restore. Mapping roadblocks

I’m not clear on your question.
ES → OS remote reindex is simply far too slow. It would be my preferred solution, but it isn’t nearly performant enough. That said, it is our solution for indices with under about 20 million docs. For our multi-billion document indexes I am hoping that snapshot and restore will give us reasonable performance. Given the performance of previous snapshot operations, I’m more optimistic there.

As an example, here is the status of a remote reindex command I kicked off last week on a portion of one prod index:

{
  "completed" : false,
  "task" : {
    "node" : "CddXr9woTuuiGSqFJgihTA",
    "id" : 587490700,
    "type" : "transport",
    "action" : "indices:data/write/reindex",
    "status" : {
      "total" : 722027082,
      "updated" : 1259,
      "created" : 253647298,
      "deleted" : 0,
      "batches" : 253788,
      "version_conflicts" : 139443,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0
    },
    "description" : """reindex from [scheme=https host=172.21.1.15 port=9200 query={
  "range" : {
    "XXXXXXXXX" : {
      "lt" : "2021-07-01T00:00:00.000Z",
      "gte" : "2021-01-01T00:00:00.000Z"
    }
  }
} username=XXXXXXX password=<<>>][source index] to [dest index][_doc]""",
    "start_time_in_millis" : 1638554932989,
    "running_time_in_nanos" : 534973482354432,
    "cancellable" : true,
    "headers" : { }
  }
}

I have a dozen or so indexes about that size that need to be transitioned. We could explore doing it in parallel in much smaller portions. I worry about increasing the scroll size on the remote reindex too much for fear of a single batch of large documents being too large for the 100mb heap of the reindex client and I would then be stuck having to re-run an entire batch.

So far, my reindex operation has been going for about 9 hours and should finish in another hour or two. Then the snapshot will take another 3 or 4 hours. Restore may take a little longer than the snapshot since the target cluster is currently smaller as we are still in testing. After that, the local reindex in OS may take 12 hours. Even if those times all hold, I will have moved 2.8bn documents in around 28 - 30 hours, which appears to be much faster than the rate I’m getting with remote reindex.

I’m open to other suggestions. Moving 100tb of data is never easy.

I will say, the other thing I have considered is an Apache Beam pipeline (we are in GCP, so we can use Google Dataflow to run it). I initially had some authentication issues with the Beam ES plugin authenticating against my OS cluster, but I haven’t revisited that in months. That could be another possible scalable solution.