I’m trying to reindex data from ES to OS through reindex API. When indexing data to the ES, previously we had set custom routing shards allocations for the documents. When reindexing the data to OS, I want to keep the same routing shards allocations for the destination index. How can I achieve this through reindex API?
Reindex API keeps the routing value by default, routing in the source documents will be fetched and set in the bulk requests when writing to the destination index.
However, the issue is, that documents keep getting deleted from the destination index while reindexing. op_type is also set to ‘index’ in reindex request
How did you find the documents keep getting deleted? Could you show the request parameters when calling reindex API?
While on reindexing I’m looking at the indices status through GET _cat/indices?v.
The request parameters are,
POST _reindex
{
“source”: {
“remote”: {
“host”: “<source_ip>:9200”
},
“index”: “<index_name>”
},
“dest”: {
“index”: “<index_name>”,
“op_type”: “index”
}
}
I think the reason why you see there’re deleted documents in the target index is that you called reindex API multiple times so some document IDs are duplicated when you call reindex API again, but it doesn’t matter, if the document with same ID exists in the target index, then the document will be updated, if not the document will be created. For update operation, actually it’s transformed to an delete operation and index operation internally, so you can see there’re many deleted
documents in the target index.
Thank you for your reply. But the thing is due to this deleted doc count getting increased, the storage is getting increased significantly.
Here are stats,
In Source Index - 397222 docs in 176.5mb
In Destination - 397222 docs with 1191666 deleted docs, all together in 1.3gb
Since the reindex API is called once, are you suggesting that the API itself do bulk indexing?
1191666= 397222*3, so I think the reindex API was called 4 times, if you were using the Dev tools in OpenSearch-Dashboards, please add a parameter wait_for_completion=false
when calling the reindex API, because the default behavior of that API is to wait until the reindex process completes, but if takes more than 30 seconds, OpenSearch-Dashboards will retry so the API is called again.
POST _reindex?wait_for_completion=false
, when wait_for_completion
is false, the reindex API will return a task ID, you can check the progress by calling the tasks API:
GET _tasks/{taskId}
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.