I’ve recently moved from Elastic towards opendistro. However if i understood correctly, opensearch is the way forward instead.
I’ve moved almost all our currently used functionalities towards opensearch, however i’m left with 1 gap:
To index SMB/NFS shares in our organisation i’ve been using FSCRAWLER (Welcome to FSCrawler’s documentation! — FSCrawler 2.10-SNAPSHOT documentation), and it’s respective docker (Docker Hub).
Is there an alternative to index files on a smb/nfs share that is compatible with opensearch?
my google-fu seems to not find anything.
Tika has explicit support of OpenSearch, but that version of the Java client has OpenSearch blocking code. There is an OpenSearch Java client in the works but in the meantime an older version of FS Crawler should work (one that uses Elasticsearch REST Client 7.13.4 or lower).
Once the OpenSearch Java client is GA, I think we could easily help FS Crawler support OpenSearch - it’s a fairly simple conversion.
@searchymcsearchface i have tried actually, with the same docker config i used for elastic.
It just starts and exits with an error code  which says basicly nothing
I’ll see about using an older version and what that gives, thanks for the suggestion!
I managed to get fscrawler working with OpenSearch, but I had to build it myself with a a few tweaks :
Like what @searchymcsearchface said, it needs version 7.13.4, or you can checkout the last known code that was using 7.13.4 from the git repo… c3d120ea33c3d53fb2182ae72d5634cd15f50593
Now you have to build it, but before this, there is a checkVersion() that needs to be commented out because it will halt fscrawler when it detects that the “7” version is a mismatch with OpenSearch’s “1” version number.
After building, you can try to run it. It will complain that no default settings found for version “1”. So just copy the folder ~/.fscrawler/_default/7/ to ~/.fscrawler/_default/1/
you have just been promoted to be my life savior ;). Much thanks for investigating this.
I’m not well versed in git and/or building from a specific version, so I’ll have to investigate. I don’t suppose you have your own repo where this version you’ve build is running in?
I suppose you run it from your local machine where you’ve done your build, as opposed to me needing a docker image, but if I remember correctly there are docker build instructions somewhere aswell related to fscrawler, so (again) i guess I’ll have to investigate.
After the current world-ending-work-crisis (whats in a name ) has been averted I’ll report back here on my findings and experiences!
If you can get FSCrawler to work, definitely go that route. David Pilato has done some great work there, and it is battle hardened.
Over on Apache Tika in 2.x, we’ve added fetchers and emitters that might be of interest. The notion is you configure a fetcher to get the bytes of files (we currently have local fileshare, s3 and gcs) and an emitter (we support local fileshare, Solr and OpenSearch)…tika takes care of most of the rest.
To scale it out, you can spin up a bunch of tika-servers in a pod and farm out requests to tika-servers. You send the file keys, tika fetches the bytes, runs the parse and emits to OpenSearch.
building from the git version before 7.13.4 does not work (to me at least) since the docker image contains a bug that is resolved later.
I’ve looked over the link you’ve given towards the presentation for the tika fetchers, SMB/NFS would definetly be needed for it to fit our needs
To “fix” fscrawler to work again with opensearch I would assume just adding the java rest client off opensearch as a different option in fscrawler would be “enough”, as right now the following modules are present:
Disclaimer: I’m also an elastic employee so I might be biased
Thanks @tallison for pinging me in the Tika issue so I discovered this discussion.
I have some plans for the future and one of my idea is to make FSCrawler even more pluggable.
I’d like to support a plugin system so people would be able to write their own plugins easily.
Another thing I’m thinking of is to remove entirely the dependency with any client and build what I need by myself.
That would mean that the same internal client would support any version of Elasticsearch which would probably mean OpenSearch as well although it won’t be tested.
Another idea would be to support a beats protocole output. That would allow people to connect FSCrawler to Logstash for example and let them use whatever output they want.
Sadly, for all this, there’s a lot of refactoring to do, and I have no idea of when I’ll be able to do this.
One of the short term thing people could do is to fork FSCrawler and change the elasticsearch client to the OpenSearch one.
Of course, I recommend using Elastic and its Workplace Search product as it provides for free a full search solution including a powerful UI and connectors to many other systems (Dropbox, gmail, GitHub…). FSCrawler supports it since 2.7.
Also adding here for reference, the discussion I had already with @Scarecrow