java.lang.InternalError: a fault occurred in an unsafe memory access operation

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch v2.13.

Describe the issue:
We are currently running two OpenSearch clusters on the same Kubernetes (AKS) cluster. The first cluster,with three master nodes and two data nodes works perfect. But, the other cluster which has three master nodes and one data node at the beginning works fine but after some hours of a reindex operation they crash with the error:

[2024-05-10T16:05:09,493][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-ops-master-1] fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
	at org.apache.lucene.codecs.lucene90.IndexedDISI.advanceBlock(IndexedDISI.java:486) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.codecs.lucene90.IndexedDISI.advance(IndexedDISI.java:443) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.codecs.lucene90.IndexedDISI.nextDoc(IndexedDISI.java:531) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$SparseNumericDocValues.nextDoc(Lucene90DocValuesProducer.java:458) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.util.BitSet.or(BitSet.java:110) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.util.FixedBitSet.or(FixedBitSet.java:326) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.util.BitSet.of(BitSet.java:42) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.opensearch.index.cache.bitset.BitsetFilterCache.bitsetFromQuery(BitsetFilterCache.java:127) ~[opensearch-2.13.0.jar:2.13.0]
	at org.opensearch.index.cache.bitset.BitsetFilterCache.lambda$getAndLoadIfNotPresent$1(BitsetFilterCache.java:173) ~[opensearch-2.13.0.jar:2.13.0]
fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation

Configuration:

Same Helm chart so I don’t know why are crashing after X hours. The difference is half the amount of resources of the nice cluster. The bad cluster has:

resources:
  requests:
    cpu: 4
    memory: 10Gi
  limits:
    cpu: 4
    memory: 10Gi

We are reindexing 20M in 6 days. And some of the nodes crash. There is a ton of space left on the volumes.

Relevant Logs or Screenshots:

[2024-05-10T16:05:09,493][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-cluster-ops-master-1] fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
	at org.apache.lucene.codecs.lucene90.IndexedDISI.advanceBlock(IndexedDISI.java:486) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.codecs.lucene90.IndexedDISI.advance(IndexedDISI.java:443) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.codecs.lucene90.IndexedDISI.nextDoc(IndexedDISI.java:531) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$SparseNumericDocValues.nextDoc(Lucene90DocValuesProducer.java:458) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.util.BitSet.or(BitSet.java:110) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.util.FixedBitSet.or(FixedBitSet.java:326) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.apache.lucene.util.BitSet.of(BitSet.java:42) ~[lucene-core-9.10.0.jar:9.10.0 695c0ac84508438302cd346a812cfa2fdc5a10df - 2024-02-14 16:48:06]
	at org.opensearch.index.cache.bitset.BitsetFilterCache.bitsetFromQuery(BitsetFilterCache.java:127) ~[opensearch-2.13.0.jar:2.13.0]
	at org.opensearch.index.cache.bitset.BitsetFilterCache.lambda$getAndLoadIfNotPresent$1(BitsetFilterCache.java:173) ~[opensearch-2.13.0.jar:2.13.0]
fatal error in thread [opensearch[opensearch-cluster-ops-master-1][warmer][T#69]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation

Upgrading to v2.14 didn´t fix this issue…

@Ivan.A apologies for the delayed reply, could you please share what JDK version is being used by your OpenSearch installation? (It is printed at the startup time). Also, could you please share if possible the process memory usage (top command), specifically RSS. Thank you.

Thanks for your reply.

We are using the OpenJDK VM/21.0.3/21.0.3+9-LTS. Regarding the “top” command, we are using the OOTB Docker Image from the Helm Chart on the artifacthub and does not come with “top”, “dsmeg”, “htop”… It’s really bare.

ÂżAny ideas?

Got it, how do you monitor the memory consumption of the container? OOM killer does not kick in (seems like a good sign), but have a hard time understanding what conditions may lead to a fault occurred in an unsafe memory access operation, one hypothesis is being low on free memory.

At the moment we don’t have any monitoring system implemented, we are working on it… Yeah, it doesn’t seem an OOM. By the way, we tried some days ago and increased the resources to:

opensearchJavaOpts: “-Xmx10g -Xms10g”

resources:
requests:
cpu: 5
memory: 24Gi
limits:
cpu: 5
memory: 24Gi

We still have some pod restarts but the indices are not corrupted as we increased the number of primary shards and replicas.

Oh, thank you for insights, seems like there is correlation with available memory, may be once you have an ability to scrape the process / container metrics, we could find out the circumstances when it happens.

Hi, we changed the StorageClass as we saw that we were using some Azure Disk class which was an HDD disk… We changed to the default one which is an SSD and all the pods are up without any restart :slight_smile: I will update if something changes but it seems the volumes.

1 Like