Migration via Snapshot restore. Mapping roadblocks

I have a good sized (100TB) production cluster I’m looking to transition from Elasticsearch 7.10 to OpenSearch (currently 1.1, but can easily upgrade to 1.X as necessary).
From reading the migration documents, it appears that the preferred path is to migrate from snapshots.
That seems fine for trivial indexes, but we have some indexes with complex mappings and we are running into a key problem: unavailable mapping types in OpenSearch.
A simple example is the flattened type from X-Pack, but we also have other custom mapping types that were produced from an in-house plugin. We have transitioned that mapping type plugin into a token filter plugin on the OpenSearch side. Our plan involved restoring the snapshots into a basic index on the OpenSearch cluster with no mappings, and then running a reindex operation with ingest pipelines to transform the data as appropriate for the new mapping.
However, we cannot find a way to have OpenSearch ignore the existing (unsupported) mappings and simply dump in the snapshot data. Is there any additional options that can be provided to the snapshot restore command to have it ignore index mappings and effectively just dump the raw data into an index? Is there another tool to ingest large amounts of data from snapshots that could achieve the same effect? We have successfully done remote reindexing to test our ingest pipelines and new analyzers, but they do not work quickly enough to perform the entire transition.

AFAIK, there is no way at the moment to have OpenSearch ignore existing mappings, I think the reasons for that are clear and it is unlikely to change in the future. Since you already tried reindexing, the other option I could suggest is to write noop plugins for unsupported mapping types (like flattened for example), by noop I mean they could do basically nothing but at least your index is going to be restorable from snapshots. And, since you wanted to ignore them in the first place anyway, that noop seems fine.

Thank you for the response.
I understand your suggestion of creating noop mappers for unsupported data types, but I don’t think that is going to be a viable solution for most users (and while I technically could do that, I don’t currently have the time to write them and then redeploy my cluster with the plugins installed).
I think that for groups trying to migrate from ElasticSearch there is possibly value in having some technique to override, drop, or manipulate mappings on snapshot load (as an option, never as default behavior).
I am currently working on an experiment to reindex a prod index into empty mappings like this:

"mappings": {
      "dynamic": "false"
    }

I hope that this will give me an index that is basically data only. Then I am going to snapshot and restore those indexes, and then finally re-index on OpenSearch into the eventual target index.

If this process works as expected we will add a couple additional data nodes onto our prod cluster and script the reindex and snapshot process. After a successful snapshot the no-mapping indexes will be deleted.

From my perspective, this process would be greatly simplified if OpenSearch implemented something like a “data-only” restore, where the data from the snapshot is restored with no mappings, leaving you free to add mappings after the fact and do an update by query or reindex into your eventual target index. Even if this endpoint was organized outside of the traditional restore API and was in some sort of data migration endpoint, it would still have some utility for orgs that are trying to transition off of ElasticSearch.

Understood, thank you for more details, I am curios why you consider es → os reindexing (cluster to cluster) to be an expensive (or better to say, unsuitable for you) but at the same time, you are still doing reindexing + manipulating mappings? You have full control of what and how to reindex, which mappings to ignore, etc.

I’m not clear on your question.
ES → OS remote reindex is simply far too slow. It would be my preferred solution, but it isn’t nearly performant enough. That said, it is our solution for indices with under about 20 million docs. For our multi-billion document indexes I am hoping that snapshot and restore will give us reasonable performance. Given the performance of previous snapshot operations, I’m more optimistic there.

As an example, here is the status of a remote reindex command I kicked off last week on a portion of one prod index:

{
  "completed" : false,
  "task" : {
    "node" : "CddXr9woTuuiGSqFJgihTA",
    "id" : 587490700,
    "type" : "transport",
    "action" : "indices:data/write/reindex",
    "status" : {
      "total" : 722027082,
      "updated" : 1259,
      "created" : 253647298,
      "deleted" : 0,
      "batches" : 253788,
      "version_conflicts" : 139443,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0
    },
    "description" : """reindex from [scheme=https host=172.21.1.15 port=9200 query={
  "range" : {
    "XXXXXXXXX" : {
      "lt" : "2021-07-01T00:00:00.000Z",
      "gte" : "2021-01-01T00:00:00.000Z"
    }
  }
} username=XXXXXXX password=<<>>][source index] to [dest index][_doc]""",
    "start_time_in_millis" : 1638554932989,
    "running_time_in_nanos" : 534973482354432,
    "cancellable" : true,
    "headers" : { }
  }
}

I have a dozen or so indexes about that size that need to be transitioned. We could explore doing it in parallel in much smaller portions. I worry about increasing the scroll size on the remote reindex too much for fear of a single batch of large documents being too large for the 100mb heap of the reindex client and I would then be stuck having to re-run an entire batch.

So far, my reindex operation has been going for about 9 hours and should finish in another hour or two. Then the snapshot will take another 3 or 4 hours. Restore may take a little longer than the snapshot since the target cluster is currently smaller as we are still in testing. After that, the local reindex in OS may take 12 hours. Even if those times all hold, I will have moved 2.8bn documents in around 28 - 30 hours, which appears to be much faster than the rate I’m getting with remote reindex.

I’m open to other suggestions. Moving 100tb of data is never easy.

I will say, the other thing I have considered is an Apache Beam pipeline (we are in GCP, so we can use Google Dataflow to run it). I initially had some authentication issues with the Beam ES plugin authenticating against my OS cluster, but I haven’t revisited that in months. That could be another possible scalable solution.

Got it now, thank you. I have not used Apache Beam, worth trying I believe, but as another option, would you consider to do local rendexing on the ES cluster? Basically, the steps would be local reindex (filtering mappings) → snapshot → restore on OS cluster? (not sure if you have enough capacity there).

Your last suggestion is what I did. That method proved to be quite fast for the scale I’m working with. We have to manage how many reindexes we do simultaneously on the prod cluster to not impact performance or disk space, but for now that is likely going to be our most effective method of transferring the data.

I would be able to save a lot of time if OpenSearch had a way to drop the mapping on the incoming index. Then I could restore directly from our existing hourly snapshot and do the expected reindex. This would cut 12 - 16 hours off of the total time and remove all impacts to my production systems.

The x-pack features / mappings are recurring issue, supporting some of them (most widely used) would probably cover 90% of migration issues. From the other side, this is one off operation, but indeed very time consuming in your case.

Indeed. There are some difficulties in my migration that are self-induced. I know that for us, losing the flattened type was the biggest topic of consideration in our transition. Since we had only started implementing flattened types within our production ElasticSearch environment it was a tradeoff we could make, but it would have been a lot easier to make the decision if some of the mappings provided in the basic x-pack license were available in OpenSearch.

The support of flattened has been opened for a while [1] but it seems like there is no progress so far, I will try to figured out what has happened with the plugin [2].

[1] Add flattened field type · Issue #1018 · opensearch-project/OpenSearch · GitHub
[2] https://github.com/aparo/opensearch-flattened-mapper-plugin

For anyone that finds this thread, I just realized there is a way to achieve what I needed (but only after doing everything the slow / painful way).
If you are able to install x-pack basic on your ElasticSearch cluster you can do a source_only repository and snapshot there. This will save the raw data with no index mapping. This will allow you to get a copy of the indexes into OpenSearch without re-indexing them on ElasticSearch first to remove the mapping. You will still need to do a re-index in OpenSearch, but it saves some time during the data transfer process: