Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch v2.13.
Describe the issue:
We are currently running two OpenSearch clusters on the same Kubernetes (AKS) cluster. The first cluster,with three master nodes and two data nodes works perfect. But, the other cluster which has three master nodes and one data node at the beginning works fine but after some hours of a reindex operation they crash with the error:
[2024-05-10T16:05:09,493][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-ops-master-1] fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
at org.apache.lucene.codecs.lucene90.IndexedDISI.advanceBlock(IndexedDISI.java:486) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.codecs.lucene90.IndexedDISI.advance(IndexedDISI.java:443) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.codecs.lucene90.IndexedDISI.nextDoc(IndexedDISI.java:531) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$SparseNumericDocValues.nextDoc(Lucene90DocValuesProducer.java:458) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.util.BitSet.or(BitSet.java:110) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.util.FixedBitSet.or(FixedBitSet.java:326) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.util.BitSet.of(BitSet.java:42) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.opensearch.index.cache.bitset.BitsetFilterCache.bitsetFromQuery(BitsetFilterCache.java:127) ~[opensearch-2.13.0.jar:2.13.0]
at org.opensearch.index.cache.bitset.BitsetFilterCache.lambda$getAndLoadIfNotPresent$1(BitsetFilterCache.java:173) ~[opensearch-2.13.0.jar:2.13.0]
fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
Configuration:
Same Helm chart so I don’t know why are crashing after X hours. The difference is half the amount of resources of the nice cluster. The bad cluster has:
resources:
requests:
cpu: 4
memory: 10Gi
limits:
cpu: 4
memory: 10Gi
We are reindexing 20M in 6 days. And some of the nodes crash. There is a ton of space left on the volumes.
Relevant Logs or Screenshots:
[2024-05-10T16:05:09,493][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-ops-master-1] fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
at org.apache.lucene.codecs.lucene90.IndexedDISI.advanceBlock(IndexedDISI.java:486) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.codecs.lucene90.IndexedDISI.advance(IndexedDISI.java:443) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.codecs.lucene90.IndexedDISI.nextDoc(IndexedDISI.java:531) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$SparseNumericDocValues.nextDoc(Lucene90DocValuesProducer.java:458) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.util.BitSet.or(BitSet.java:110) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.util.FixedBitSet.or(FixedBitSet.java:326) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.apache.lucene.util.BitSet.of(BitSet.java:42) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
at org.opensearch.index.cache.bitset.BitsetFilterCache.bitsetFromQuery(BitsetFilterCache.java:127) ~[opensearch-2.13.0.jar:2.13.0]
at org.opensearch.index.cache.bitset.BitsetFilterCache.lambda$getAndLoadIfNotPresent$1(BitsetFilterCache.java:173) ~[opensearch-2.13.0.jar:2.13.0]
fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation