OpenSearch Insufficient Memory Regression after 2.19 Upgrade

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.19; regression introduced between 2.16 and 2.19.

Describe the issue:

After upgrading from 2.16 to 2.19, our OpenSearch cluster has been periodically restarting (approximately every day) with the following error:

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2097152 bytes. Error detail: AllocateHeap
# An error report file with more information is saved as:
# /usr/share/opensearch/hs_err_pid10.log

JVM memory is within normal bounds so we suspect this failure is due to the max map count being reached. We’ve been setting vm.max_map_count to 262144 via sysctl and this has previously not caused issues. Have there been any changes between 2.16 and 2.19 that would have caused a regression here?

Configuration:

Cluster running 4 x {64 cpu, 512GB} nodes backed by GCP persistent disks. 102GB heap.

Relevant Logs or Screenshots:

Kubernetes memory is well below limit. We are unable to get the hs_error_log file that is dumped at crash time.

Let me know if there’s any other information that could help diagnose the issue.

Digging further, via cat /proc/10/maps, 246K of the 252K open maps are deleted .dvd files. Is this indicative of a lucene issue?

I am running into the same issue after updating from 2.17.1 to 2.19.1. I was running Java 17 and updated to Java 21. This stabilized my cluster a bit, but I’m still running into that issue after some hours.

I also figured out that most of the open maps are deleted files. So I set vm.max_map_count far higher than default to “solve” that issue.

I asked the lucene mailing list here: https://lists.apache.org/thread/4lqh5w9mxm4ffr5kxlxhh06d9gdv3gto. Will try what was suggested and get back to this thread.

1 Like