Best way to index and update 20TB of files on a shared drive?

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.18 / Windows 10

Describe the issue:
Hi,

I’m looking for recommendations on tools or methods to efficiently scan, index, and continuously update a shared drive containing about 20TB of various file types (e.g., documents, PDFs, images, etc.)

What tools or workflows would you recommend? Are there any specific best practices for managing large-scale shared drive indexing with OpenSearch?

Thanks in advance for your insights!

Configuration:

Relevant Logs or Screenshots:

Hi @syo, here are some good tips on best practices:

Best,
mj

Since I’m working on something similar at the moment, I can give you some general suggestions:

Think which kind of data structure you want to have:

  • Do you need to restrict the search based on Group permissions for different folders, or has everyone full access?
  • Field suggestions: title, change_date, file_type, fulltext, owner, (permission_groups)

Use the ingest-attachment pipeline to parse documents into fulltext

Filter which documents you index based on mimetypes
You might want to only index Word/Powerpoint/Excel/text/etc. and exclude zip and binary files

Start testing everything with a small subfolder, then increase the amount you index
We currently index 100 documents at once in a batch request, our fileshare has ~45.000 documents which only is a few TB if at all. Yet our indexing runs into memory issues because I currently believe Opensearch 2.15 has a memory leak.

Good look, its a lot of work to get right.