Reindex job failing with search phase execution exception

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch 1.0

Describe the issue:

Reindex job failing with search_phase_execution_exception. We tried decreasing the batch size from default 1000 to 100 and still see the same issue. Any other options to try with?

Configuration:

Relevant Logs or Screenshots:

Can you show more information about your problem? Such as the full log of the search_phase_execution_exception and the reindex parameters.

API is very simple as below. We are running 10 parallel reindex jobs and under load we hit this

POST /_reindex
{
   "source":{
      "index":"sourceIndex",
     "size": 100
   },
   "dest":{
      "index":"destIndex"
   }
}

Below is the response from task api of that reindex operation

{
"completed": true,
"task": {
"node": "abc",
"id": 182462425,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 1629142,
"updated": 0,
"created": 128300,
"deleted": 0,
"batches": 1283,
"version_conflicts": 0,
"noops": 0,
"retries":

{ "bulk": 0, "search": 0 }

,
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "reindex from [sourceIndex] to [destIndex1][_doc]",
"start_time_in_millis": 1694273514552,
"running_time_in_nanos": 1339667737279,
"cancellable": true,
"headers": {}
},
"error": {
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": -1,
"index": null,
"reason":

{ "type": "search_context_missing_exception", "reason": "No search context found for id [61314384]" }

}
],
"caused_by":

{ "type": "search_context_missing_exception", "reason": "No search context found for id [61314384]" }

}
}

This likely means that the underlying scroll expired. You could increase the timeout (which is really for processing each page), but the default scroll should be 5m. So I’m assuming your OpenSearch cluster can’t keep up with the load, that it can’t process a page in 5 minutes…

Maybe you can run less reindexing jobs in parallel?

@makam.sreekanth

As @radu.gheorghe said,

when you create heavy _tasks for reindexing like:

POST _reindex?wait_for_completion=false
{
  "conflicts": "proceed",
  "source": {
    "index": ["abc.prd.reindex_2020"],
    "size": 100
  },
  "dest": {
    "index": "search.abc.prd",
    "version_type": "external"
  },
  "script": {
    "source": """
      ctx._source.DELETE_YN= 'N';
    """,
    "lang": "painless"
  }
}

you should insert scroll query parameter.
Please increase from the default interval(5m) to the larger one(1d) or decrease size of _source.