Memory Leak/Garbage Colector issues

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

“version” : {
“distribution” : “opensearch”,
“number” : “2.15.0”,
“build_type” : “rpm”,
“build_hash” : “61dbcd0795c9bfe9b81e5762175414bc38bbcadf”,
“build_date” : “2024-06-20T03:27:31.591886152Z”,
“build_snapshot” : false,
“lucene_version” : “9.10.0”,
“minimum_wire_compatibility_version” : “7.10.0”,
“minimum_index_compatibility_version” : “7.0.0”
},

Server OS: SLES15

Describe the issue:

We are trying to index a large amount of files from a fileshare. For this, 100 documents are indexed per batch, running through an ingest-attachment as well as embedding pipeline for vector embeddings. This works well, with one issue: memory is constantly increasing, until the opensearch process gets killed by the system.

I already tried increasing RAM and heap space, but this does not solve the issue:
after 19.400 indexed documents, the process gets killed because it uses too much memory. Regardless of the total available memory or the configured heap size.

Tested with:
16GB total memory, 4GB heap
32GB total memory, 16GB heap

Both configurations break exactly after the same amount of indexed batches: 194/451.


How can we avoid that Opensearch keeps using more memory during continous indexing?

Hi @devmoreng,

have you tried adjusting the refresh interval?

here is more info:

best,
mj

No I was unaware of this. Thank you! Will try that out.

1 Like

Sadly deactivating the refresh interval did not help, the process still runs OOM. But I added additional logging and this looks like the non-heap memory keeps growing and runs out.

When I tested with a small batch during the day, the memory usage also increased and didn’t go back down until I restartet opensearch. We could add a monitoring and reboot opensearch everytime memory gets low, but that can only be a temporary workaround. I’ll investigate further if the embedding pipeline causes this, but any help is highly appreciated.

I might be related to [BUG] High memory consumption · Issue #15934 · opensearch-project/OpenSearch · GitHub, @devmoreng are you on managed AWS or self-hosted?

I already found that issue, unfortunately it has no real solution. And no, we are not on AWS but self-hosted.

So I tried now with a small subest of files with and without vector-embedding pipeline. To me it seems, the vector-embedding pipeline might be what causes the increasing memory usage. If I disable it, heap memory usage is similar, but overall memory usage does not increase.

The pipeline in question takes base64 encoded data and first runs it through the ingest-attachment plugin, then chunks the text and runs multiple text chunks through the text-embedding. For the test above I only excluded the text-embedding part and memory stayed fine.

{
  "attachment": {
    "description": "Extract attachment information, map to fulltext, and generate embeddings",
    "processors": [
      {
        "attachment": {
          "field": "data",
          "target_field": "attachment"
        }
      },
      {
        "set": {
          "field": "fulltext",
          "value": "{{attachment.content}}"
        }
      },
      {
        "script": {
          "source": """
                                if (ctx.language == null && ctx.attachment.language != null) {
                                    ctx.language = ctx.attachment.language;
                                }
                                if (ctx.file_type == null && ctx.attachment.content_type != null) {
                                    ctx.file_type = ctx.attachment.content_type;
                                }
                            """
        }
      },
      {
        "text_chunking": {
          "field_map": {
            "fulltext": "fulltext_chunks"
          },
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10000,
              "overlap_rate": 0.1
            }
          }
        }
      },
      {
        "text_chunking": {
          "field_map": {
            "fulltext_chunks": "tmp_chunks"
          },
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 384,
              "overlap_rate": 0.2
            }
          }
        }
      },
      {
        "text_embedding": {
          "model_id": "2ggnPJQBN3iZDHu2iDI2",
          "field_map": {
            "tmp_chunks": "tmp_knn"
          }
        }
      },
      {
        "script": {
          "source": """
                    if (ctx.tmp_chunks != null && ctx.tmp_knn != null) {
                        ctx.vector_embeddings = [];
                        for (int i = 0; i < ctx.tmp_chunks.size(); i++) {
                            ctx.vector_embeddings.add(['chunk_text': ctx.tmp_chunks[i], 'knn': ctx.tmp_knn[i].knn]);
                        }
                    }
                """
        }
      },
      {
        "remove": {
          "ignore_failure": true,
          "field": [
            "tmp_chunks",
            "tmp_knn"
          ]
        }
      },
      {
        "remove": {
          "ignore_failure": true,
          "field": [
            "data",
            "attachment"
          ]
        }
      },
      {
        "remove": {
          "ignore_failure": true,
          "field": "fulltext"
        }
      }
    ]
  }
}
1 Like

Thank you @devmoreng for digging into it, you are using ml-commons/k-nn plugin for embedding or something else?

Yes the pipeline is based on this documentation about semantic search:

and this tutorial about neural search:

We are using the recommended model huggingface/sentence-transformers/msmarco-distilbert-base-tas-b

And yes we use the ml-commons/knn plugin, which comes bundled I believe? At least I only installed the ingest-attachment plugin separately.

I am currently checking with colleagues if we can update to opensearch 2.18 to see if that fixes the memory problem.

Got it, may be opening an issue on plugin itself would help [1], thank you

[1] GitHub · Where software is built