I’m trying to find the best way to solve a problem.
We have an index which contains documents containing domain knowledge. This domain knowledge is made up of various solutions to problems, articles covering important topics, etc.
What I’m trying to do is take a person’s description of problem they are experiencing and then try suggest some documents from the index which might help the user solve the issue. Think implementing a self-service portal where a customer enters in a description of the problem they are having and then we take that description and try to find articles that might help them solve the problem.
The descriptions we need to query against could be something as simple as something like:
Help! I'm trying to log in and keep getting an error.
Thanks!
Or could be more complex with lots of details:
I've been trying to log into our portal and keep getting the following error:
"An unexpected problem has occurred and you cannot be logged in at this time."
I've tried going through the "Forgot password" process, but that is not working either.
I was able to log in fine the other day, so I'm not sure why I'm having issues now.
Can someone please help me?
My first thought was to implement try implementing this using a more_like_this
query, but this is not getting me great results. Given that it’s designed to find similiarities between two pieces of text, it just ends up putting too much weight on common words and so the results I get are not great.
Since that was not working well, I tried a different approach by using the Term Vectors API to try and extract important keywords from user’s problem and then use those keywords to search on. My current strategy is to run the user’s problem description and then do a analyzing of the vector terms against the “title” of our domain knowledge documents. Since the title’s tend to contain more dominate keywords (like the product names), this approach seems to be giving much better reults. Basically my approach is:
- Run the user’s description of the problem to the “_termvectors” endpoint.
- Only search the document’s title. My DSL looks like:
{
"doc": {
"title": "Help! I'm trying to log in and keep getting an error. Thanks!"
}
, "fields": ["title"]
, "term_statistics": true
, "field_statistics": true
, "positions": false
, "offsets": false
, "filter": {
"max_num_terms": 25
, "min_term_freq": 1
, "min_doc_freq": 10
, "min_word_length": 4
}
}
- I then take the vector terms returned and build a “query_string” search:
{
"size": 10
, "query": {
"query_string": {
"query": "article:(login error)"
, "minimum_should_match": 2
}
}
}
While this definitely works better, the results still are not good enough in a lot of cases.
Is there a better way to approach this issue?
I could feed the text into a generative text LLM and try to extract a search phrase from the content, but that’s slow. It feels like I’m already trying to solve a problem that has been done before with OpenSearch without LLMs.
Is there a better way to extract critical keywords from the text?
Ideally unique product names would be identified as critical keywords and carry more weight. Common words (which are not stop words), would be ignored or carry very little weight. So if we look at our simple example of Help! I'm trying to log in and keep getting an error. Thanks!
, we would key in on log in
and error
as the critical keywords we should be trying to find. Words like “keep” and “getting” are not really going to help you find solutions.
Looking forward to hearing any suggestions to how to solve my problem efficiently!
Thanks!