Recovering from false positive corrupted shard

Hi everyone,

My team is experiencing a situation similar to the one described here.

During bulk indexing activity, the index is put into a red state due to an unassigned shard. At the location of the shard, I find a corrupted_* file that prevents the shard from being assigned. However, running CheckIndex on the Lucene index reports that there were no problems detected with the index. By removing the corrupted_* file and restarting the cluster, the shard gets assigned and the index returns to green.

I believe this is being caused by disk latency (sshfs). If anyone has any experience with heavy indexing on sshfs, I’d be interested to hear any tips for improving performance and preventing this all together.

I’m wondering if there is a way to recover this index without restarting the cluster? What happens at cluster initialization that triggers the assignment? Can I execute this without taking the cluster down?

Thanks,
Derek

Hi,

I encounter the same issue, but without using sshfs. I have a single-node opensearch cluster (for graylog) that runs in a Docker container. Checking the red index with lucence results in a clean index, removing the currupted_ file and restarting the cluster works for me too. I encountered this issue now 3x in the last 2 weeks, while I never encountered it before running graylog for 6 months. Could this be a bug introduced in the last update?

Cheers,
Chris

If anyone knows how to downgrade an opensearch cluster, I could test this. I quickly tried by switching back to the old version, did not work.