example: “Hello I am a sentence about OpenSearch.”, [0.23455, 0.644, 0.0, 0.3446, … , 0.1395], “/path/to/file.pdf”
The workflow that I am imagining to implement is like this:
.1 search for a string (can be anything, might not be in the database)
.2 receive a list of most similar sentences (based on feature vector distance)
.3 open the associated file of the closest result
Do you have a recommendation how I could go about this? Is OpenSearch the right tool for this? I am very early in my research of how to make use of OpenSearch, in case I am missing something obvious please forgive me.
There are three methods described in the link below to perform in a way what you want using the kNN feature of OpenSearch. However, it is required that you create your index such that file content is stored as knn vector. Then the calling application should convert the search query into a kNN vector and pass it to the kNN query. In other words, OpenSearch provides vector similarity search but how you use it is up to you.
The k-nn plugin seems exactly what I am looking for. I can’t seem to find info on how the plugin turns a piece of text into a vector- you mention it could replace my pre-processing via numpy @searchymcsearchface
My original approach was for the calling application to pass the feature vector and let the k-nn plugin’s approximate search do the heavy work. Like that I can tweak the feature vector extraction with python/spacy.
What I was secretly hoping for, is some example-app that does k-nn based text-recommendation using OpenSearch. It must be a basic use case, or am I mistaken?
Thank you @asfoorial. Haystack seems to be exactly that I am looking for. How have I missed this!!!
Re your initiative to extend Open Search with haystack, I also think it’d be a milestone, but I’d have to make some benchmarks on pure python haystack first. Will keep you updated here. Let me know how I can help, I do have some experience with embedding python environments in c++, but not in Java.
I think that the way to approach it is to have another Python enviroment running side-by-side with OpenSearch. We will also need to develop APIs as follows:
This API recieves text (and optionally opensearch query), perform semantic search and return opensearch doc ids, and sentense highlights (The highlights in this case would be the matching sentences).
Of course the whole haystack environment is going to be OpenSearch as the backend for data indexing.
The challenge here, how can we integrate OpenSearch security with Haystack such that users cannot call Haystack APIs directly but rather through OpenSearch authentication/ssl layer. Perhaps we can host python inside the nodes of OpenSearch and let OpenSearch communicate with it locally. Then we would need to worry about resource sharing between the two!
This can be an REST API plugin that recieves two parameters, free text query and an OpenSearch query to perform initial filtering. These two can then be directly passed to Haystack API in the first point to perform the magic.
In fact, one of the main challenges that I will face when using Haystack with OpenSearch is related to data security (specifically document level security).
I have a document repo with an authorization layer specifying which user can access which document. Currently I use OpenSearch to index all these documents while reflecting the same authorization rules defined at the source.
Now I can use Haystack to index all those documents, but would I be able reflect the same security at the source? OpenSearch is already capable of doing that but can Haystack instruct it to do so?
Also, knowing BERT-based models they mostly work on a sentense/paragraph level, how would I pass a complete set of multi-pages documents to it?
My other question would be - do you need a fully-fledged question answering system on top (QA), or just a better ‘semantic search,’ or something else.
Re BERT and whole documents - you can just use the “Elastic” Retriever that we have (using BM25) if you don’t want/can’t reindex, and then it’s the job of the reader to split it further. This would also be slower than if fully reindexed first with Haystack. (That’s mostly about a QA pipeline).
If you can reindex, then you have a choice of using the preprocessor to split the documents into ‘passages’ and that’ll speed it up for the reader then. Again, if you’d like to leverage the metadata filtering, that would require reindexing too.
I’m hoping that answers your question above Happy to try elaborate on it here, or you’re also welcome to chat in the Haystack community channel when it’s the right time.
I am using OpenSearch security plugin which enables defining security roles that can be applied on an index level or document level. In simple words, I have an index my_docs that has content, accessors_names and other metadata as fields in the index. I also have a role with document-level security which defines if the current calling user is listed in the accessors_names field of a particular document.
I want to extend my current setup with QA and semantic search capabilities. I am open to rebuilding the indices.
what would the answer be? Is it extractive or generative? That is, does Haystack returns documents or generate sentences that could from multiple documents?
How would 1 work when it comes to the security described above? Does Haystack interact with OpenSearch security in any way? (Seems to me that is a must)
Is it possible to let OpenSearch call Haystack instead? So OpenSearch does the initial keyword BM25 search and then pass the result to Haystack to execute QA on the resulting documents ( I expect them to be 10-100 max). This way we ensure that security is already maintained by OpenSearch because we know that all the resulting docs are accessible by the calling user.
That is why I mentioned earlier in this thread and other threads that Haystack can become an internal/external component that OpenSearch can leverage.
I am also not sure if Haystack has any data authorization features at this point since it does not store data by itself.
Currently we don’t use any of it, no. But that’s a great feature request, so thanks
I do not think this is currently possible.
Re data authorization features - not really, aside from maybe leveraging the metadata and a basic separation by index. Something for us to look more into We also try to be a bit flexible with the document stores, but I can totally see how leveraging certain advanced aspects of a particular docstore might be beneficial.
Great! Thanks for the update. I will give it a try.
I also need to find a proper way to make the two work together in an enterprise setup where all security, authentication and authorization measures are met while maintaining the expected user experience from word search and semantic search.