We tried migrating older log indices from servers with SSDs to servers with HDDs, but the HDD nodes kept falling out of the cluster.
[o.o.c.c.C.CoordinatorPublication] [osmanager-0000] after [10s] publication of cluster state version [933855] is still waiting for {oslts-0000}
Would moving the state directory off of the HDD array and onto the local NVMe boot drive help avoid this?
I could just try it myself, but I’d like to get a second opinion first. I don’t have a test cluster with enough activity to properly test the outcome and I’d like to avoid another extended period of instability.
@reshippie The cluster state publication timeout you’re seeing (still waiting for {oslts-0000} ) is a classic symptom of HDD latency being too slow. Moving nodes/0/_state to NVMe should resolve this.
I tested locally and was able to get it up and running using the following:
mv /data/nodes/0/_state /mnt/nvme/opensearch-state
mkdir /data/nodes/0/_state # create the mount point
mount --bind /mnt/nvme/opensearch-state /data/nodes/0/_state
If this doesn’t work for you, can you confirm the OS version you are using and method of deployment?
Thanks for the confirmation.
I was also able to verify that mounting the directory works, and also that using a symlink doesn’t.
I don’t have time this week to do any real performance testing to validate that it will actually keep the servers in the cluster, but I’m cautiously optimistic.