Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.18 / Windows 10
Describe the issue:
Hi,
I’m looking for recommendations on tools or methods to efficiently scan, index, and continuously update a shared drive containing about 20TB of various file types (e.g., documents, PDFs, images, etc.)
What tools or workflows would you recommend? Are there any specific best practices for managing large-scale shared drive indexing with OpenSearch?
Since I’m working on something similar at the moment, I can give you some general suggestions:
Think which kind of data structure you want to have:
Do you need to restrict the search based on Group permissions for different folders, or has everyone full access?
Field suggestions: title, change_date, file_type, fulltext, owner, (permission_groups)
Use the ingest-attachment pipeline to parse documents into fulltext
Filter which documents you index based on mimetypes
You might want to only index Word/Powerpoint/Excel/text/etc. and exclude zip and binary files
Start testing everything with a small subfolder, then increase the amount you index
We currently index 100 documents at once in a batch request, our fileshare has ~45.000 documents which only is a few TB if at all. Yet our indexing runs into memory issues because I currently believe Opensearch 2.15 has a memory leak.