Facing a fatal error in the java runtime environment after injecting I/O latency

chirumanem · February 29, 2024, 1:57am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch Version 2.9.0
Java-11-open JDK

Describe the issue:
We have opensearch cluster with 2 data, 3 master, and 1 ingest pods.

As part of our internal testing, we have tested Chaos Mesh - Network Storage IO latency by injecting a latency (delays file system calls) of 60s (delay) for 60s (duration) toward the file system in directory opt/opsearch/data with a 100% probability of failure per operation (percent) mounted in opensearch containers (data and master) for all pods (mode).

The scope of this study is to observe how containers/pods behave when faced with Storage latency and to verify that data has persisted and is not corrupted.

We discovered that injecting I/O latency chaos leads to a fatal error in the Java Runtime Environment in the data containers (we haven’t encountered such failures in the master containers) and also observed liveness probe failure following the fatal error and container has been restarted.

We tried raising the failurethreshold of the liveness probe as a workaround, but it didn’t help.

Could someone help me identify the reason of this behavior and suggest a course of action to fix the fatal error that occurs in the java runtime environment after introducing I/O latency chaos?

chirumanem · March 12, 2024, 4:43am

Hi,
Any update on this request?

reta · March 12, 2024, 2:42pm

@chirumanem could you be more specific regarding the errors you observe? If those are really related to JVM / JRE, please ask OpenJDK community [1] instead. Thank you.

[1] mail.openjdk.org Mailing Lists

chirumanem · March 15, 2024, 5:04am

Hi @reta ,
The below are the errors we are observing after applying IO latency of 60s (delay) for 60s (duration) toward the file system in directory opt/opsearch/data.

{“message”:“# A fatal error has been detected by the Java Runtime Environment:”,“metadata”:{“container_name”:“data”,“namespace”:“zmanrao”,“pod_name”:“eric-data-search-engine-data-0”},“service_id”:“eric-data-search-engine”,“severity”:“info”,“timestamp”:“2024-02-27T12:41:22.203+00:00”,“version”:“1.2.0”}
{“message”:“#”,“metadata”:{“container_name”:“data”,“namespace”:“zmanrao”,“pod_name”:“eric-data-search-engine-data-0”},“service_id”:“eric-data-search-engine”,“severity”:“info”,“timestamp”:“2024-02-27T12:41:22.203+00:00”,“version”:“1.2.0”}
{“message”:“# SIGSEGV (0xb) at pc=0x00007f03f856a5fc, pid=279, tid=5144”,“metadata”:{“container_name”:“data”,“namespace”:“zmanrao”,“pod_name”:“eric-data-search-engine-data-0”},“service_id”:“eric-data-search-engine”,“severity”:“info”,“timestamp”:“2024-02-27T12:41:22.203+00:00”,“version”:“1.2.0”}

Containers are taking restart after above error logs due to livenessprobe failure.

Regards,
Chiranjeevi

reta · March 15, 2024, 3:26pm

This generally indicates the JVM crash ( SIGSEGV), please submit the report to OpenJDK project, thank you.

Topic		Replies	Views
java.lang.InternalError: a fault occurred in an unsafe memory access operation OpenSearch troubleshoot	7	485	May 28, 2024
Opensearch ingestion is slow and timeouts are occuring very frequently OpenSearch	11	319	January 20, 2025
OpenSearch Insufficient Memory Regression after 2.19 Upgrade OpenSearch	5	245	June 11, 2025
Cause of UnassignedShards and NodeDisconnectedException Errors OpenSearch	0	14	March 11, 2025
Node timeout/crash, will not resync, has to be killed OpenSearch troubleshoot	2	32	June 17, 2025

Facing a fatal error in the java runtime environment after injecting I/O latency

Related topics