We are trying to index a large amount of files from a fileshare. For this, 100 documents are indexed per batch, running through an ingest-attachment as well as embedding pipeline for vector embeddings. This works well, with one issue: memory is constantly increasing, until the opensearch process gets killed by the system.
I already tried increasing RAM and heap space, but this does not solve the issue:
after 19.400 indexed documents, the process gets killed because it uses too much memory. Regardless of the total available memory or the configured heap size.
Tested with:
16GB total memory, 4GB heap
32GB total memory, 16GB heap
Both configurations break exactly after the same amount of indexed batches: 194/451.
Sadly deactivating the refresh interval did not help, the process still runs OOM. But I added additional logging and this looks like the non-heap memory keeps growing and runs out.
When I tested with a small batch during the day, the memory usage also increased and didn’t go back down until I restartet opensearch. We could add a monitoring and reboot opensearch everytime memory gets low, but that can only be a temporary workaround. I’ll investigate further if the embedding pipeline causes this, but any help is highly appreciated.
So I tried now with a small subest of files with and without vector-embedding pipeline. To me it seems, the vector-embedding pipeline might be what causes the increasing memory usage. If I disable it, heap memory usage is similar, but overall memory usage does not increase.
The pipeline in question takes base64 encoded data and first runs it through the ingest-attachment plugin, then chunks the text and runs multiple text chunks through the text-embedding. For the test above I only excluded the text-embedding part and memory stayed fine.