Recovering from false positive corrupted shard

Hi everyone,

My team is experiencing a situation similar to the one described here.

During bulk indexing activity, the index is put into a red state due to an unassigned shard. At the location of the shard, I find a corrupted_* file that prevents the shard from being assigned. However, running CheckIndex on the Lucene index reports that there were no problems detected with the index. By removing the corrupted_* file and restarting the cluster, the shard gets assigned and the index returns to green.

I believe this is being caused by disk latency (sshfs). If anyone has any experience with heavy indexing on sshfs, I’d be interested to hear any tips for improving performance and preventing this all together.

I’m wondering if there is a way to recover this index without restarting the cluster? What happens at cluster initialization that triggers the assignment? Can I execute this without taking the cluster down?