Facing a fatal error in the java runtime environment after injecting I/O latency

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch Version 2.9.0
Java-11-open JDK

Describe the issue:
We have opensearch cluster with 2 data, 3 master, and 1 ingest pods.

As part of our internal testing, we have tested Chaos Mesh - Network Storage IO latency by injecting a latency (delays file system calls) of 60s (delay) for 60s (duration) toward the file system in directory opt/opsearch/data with a 100% probability of failure per operation (percent) mounted in opensearch containers (data and master) for all pods (mode).

The scope of this study is to observe how containers/pods behave when faced with Storage latency and to verify that data has persisted and is not corrupted.

We discovered that injecting I/O latency chaos leads to a fatal error in the Java Runtime Environment in the data containers (we haven’t encountered such failures in the master containers) and also observed liveness probe failure following the fatal error and container has been restarted.

We tried raising the failurethreshold of the liveness probe as a workaround, but it didn’t help.

Could someone help me identify the reason of this behavior and suggest a course of action to fix the fatal error that occurs in the java runtime environment after introducing I/O latency chaos?

Hi,
Any update on this request?

@chirumanem could you be more specific regarding the errors you observe? If those are really related to JVM / JRE, please ask OpenJDK community [1] instead. Thank you.

[1] mail.openjdk.org Mailing Lists

Hi @reta ,
The below are the errors we are observing after applying IO latency of 60s (delay) for 60s (duration) toward the file system in directory opt/opsearch/data.

{“message”:“# A fatal error has been detected by the Java Runtime Environment:”,“metadata”:{“container_name”:“data”,“namespace”:“zmanrao”,“pod_name”:“eric-data-search-engine-data-0”},“service_id”:“eric-data-search-engine”,“severity”:“info”,“timestamp”:“2024-02-27T12:41:22.203+00:00”,“version”:“1.2.0”}
{“message”:“#”,“metadata”:{“container_name”:“data”,“namespace”:“zmanrao”,“pod_name”:“eric-data-search-engine-data-0”},“service_id”:“eric-data-search-engine”,“severity”:“info”,“timestamp”:“2024-02-27T12:41:22.203+00:00”,“version”:“1.2.0”}
{“message”:“# SIGSEGV (0xb) at pc=0x00007f03f856a5fc, pid=279, tid=5144”,“metadata”:{“container_name”:“data”,“namespace”:“zmanrao”,“pod_name”:“eric-data-search-engine-data-0”},“service_id”:“eric-data-search-engine”,“severity”:“info”,“timestamp”:“2024-02-27T12:41:22.203+00:00”,“version”:“1.2.0”}

Containers are taking restart after above error logs due to livenessprobe failure.

Regards,
Chiranjeevi

This generally indicates the JVM crash ( SIGSEGV), please submit the report to OpenJDK project, thank you.