Trying to reindex overallocated shards - getting timeouts/errors for put-mapping

Hi everyone! I am in the process of trying to reindex indexes in my cluster that are over- or under-allocated in terms of shards.

For background: We run a large cluster, 3 master and 20-hot and 20-warm architecture, and currently have too many shards. We have 47,000 and are trying to get down to 30,000. This is likely contributing to the problem I’ll mention below, but I am hoping to get some workarounds and suggestions from the community.

I have a program that is identifying indexes that have shards with more than 50 gb of data OR less than 30gb, and then reindexing that index into a new index with the proper number of primary shards. All of these indexes are closed/“warm” indexes, so no data is being written to them.

For example, if I have an index with 2 primary shards (and 1 replica) of 120gb, my program would create a new index, composed of 3 primary shards, and the start the reindexing process.

{
“source”: {
“index”: “overallocated_index”
},
“dest”: {
“index”: “overallocated_index_updated_3”
}
}

I have created templates that are grepping on the “updated_3” index name and the new index created with the corresponding number of shards. There are individual templates for 1 to 20 shards (max number of shards we could put in an index).

After running “POST _reindex?wait_for_completion=false” I receive back the task id. Inevitably, when I check on the task, the reindexing job will error out with the following error(s). This output was gathered from “GET /_tasks/”:

{ "completed" : true, "task" : { "node" : "rn2QEhQ1SLu5y4Pfu-JEfw", "id" : 5833509, "type" : "transport", "action" : "indices:data/write/reindex", "status" : { "total" : 67226465, "updated" : 0, "created" : 974, "deleted" : 0, "batches" : 1, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0 }, "description" : "reindex from [overallocated_index] to [overallocated_index_updated_3][_doc]", "start_time_in_millis" : 1611269643243, "running_time_in_nanos" : 121249761133, "cancellable" : true, "headers" : { } }, "response" : { "took" : 121248, "timed_out" : false, "total" : 67226465, "updated" : 0, "created" : 974, "deleted" : 0, "batches" : 1, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled" : "0s", "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until" : "0s", "throttled_until_millis" : 0, "failures" : [ { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "JFcxHnYBaVD6sg8NSTlr", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "JVcxHnYBaVD6sg8NSTlr", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "2GIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "2WIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "2mIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "22IxHnYB7UG0O9MhS2Si", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "3GIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "3WIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "3mIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "32IxHnYB7UG0O9MhS2Si", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "4GIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "4WIxHnYB7UG0O9MhS2Si", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "TYExHnYBJlUuzeVeUbYB", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "ToExHnYBJlUuzeVeUbYB", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "T4ExHnYBJlUuzeVeUbYB", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "UIExHnYBJlUuzeVeUbYB", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "UYExHnYBJlUuzeVeUbYB", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "UoExHnYBJlUuzeVeUbYB", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "oYExHnYBJlUuzeVeU8P3", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "hVIxHnYBH6cQ6haxU0ve", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "hlIxHnYBH6cQ6haxU0ve", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "LXoxHnYB013kt2C3V2V1", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "CXMrHnYBJlUuzeVexpdo", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "g4UrHnYByqH8kh7lx03m", "cause" : { "type" : "process_cluster_event_timeout_exception", "reason" : "failed to process cluster event (put-mapping [overallocated_index_updated_3/qebVLPN3TseYaRVxMfScjQ]) within 30s" }, "status" : 503 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "hIUrHnYByqH8kh7lx03m", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 }, { "index" : "overallocated_index_updated_3", "type" : "_doc", "id" : "p80rHnYBlfgQDqa5y-cG", "cause" : { "type" : "mapper_exception", "reason" : "timed out while waiting for a dynamic mapping update" }, "status" : 500 } ] } }

Even though there are a bunch of 500 errors, a few hundred thousand of docs still get reindexed into the new index. It’s a small portion of the documents compared to the original index.

I thought that maybe the errors were related to index not being created until I run the POST _reindex command (i.e. the template having to be applied dynamically on index creation). But, I ruled this out by creating the new index well in advance of a reindexing job, meaning the template had already been applied to the new index.

The template schema I am using is using Elastic Common Schema, so it is a larger template. Too large to paste here - but here is the basic gist. Here are our settings applied in our templates:

"settings":{ "index":{ "mapping":{ "total_fields":{ "limit":10000 } }, "routing.allocation.require.box_type":"warm", "codec":"best_compression", "refresh_interval":"5s", "number_of_shards":"3", "number_of_replicas":"1", "unassigned":{ "node_left":{ "delayed_timeout":"15m" } } } } }

Can anyone clarify if what is happening here is that documents are timing out when being added to the new index (given the errors about dynamic mapping)? It is confusing to me, since the documents being reindexed had already been in an index with a the same template applied (with the exception of the number of primary shards).

If anyone has another solution, I am all ears. We could potentially also restore from snapshot into an index with a different name.