Finding documents based text content

I’m trying to find the best way to solve a problem.

We have an index which contains documents containing domain knowledge. This domain knowledge is made up of various solutions to problems, articles covering important topics, etc.

What I’m trying to do is take a person’s description of problem they are experiencing and then try suggest some documents from the index which might help the user solve the issue. Think implementing a self-service portal where a customer enters in a description of the problem they are having and then we take that description and try to find articles that might help them solve the problem.

The descriptions we need to query against could be something as simple as something like:

Help! I'm trying to log in and keep getting an error.

Thanks!

Or could be more complex with lots of details:

I've been trying to log into our portal and keep getting the following error:

"An unexpected problem has occurred and you cannot be logged in at this time."

I've tried going through the "Forgot password" process, but that is not working either. 

I was able to log in fine the other day, so I'm not sure why I'm having issues now.

Can someone please help me?

My first thought was to implement try implementing this using a more_like_this query, but this is not getting me great results. Given that it’s designed to find similiarities between two pieces of text, it just ends up putting too much weight on common words and so the results I get are not great.

Since that was not working well, I tried a different approach by using the Term Vectors API to try and extract important keywords from user’s problem and then use those keywords to search on. My current strategy is to run the user’s problem description and then do a analyzing of the vector terms against the “title” of our domain knowledge documents. Since the title’s tend to contain more dominate keywords (like the product names), this approach seems to be giving much better reults. Basically my approach is:

  1. Run the user’s description of the problem to the “_termvectors” endpoint.
  2. Only search the document’s title. My DSL looks like:
{
	"doc": {
		"title": "Help! I'm trying to log in and keep getting an error. Thanks!"
	}
	, "fields": ["title"]
	, "term_statistics": true
	, "field_statistics": true
	, "positions": false
	, "offsets": false
	, "filter": {
		  "max_num_terms": 25
		, "min_term_freq": 1
		, "min_doc_freq": 10
		, "min_word_length": 4
	}
}
  1. I then take the vector terms returned and build a “query_string” search:
{
	"size": 10
	, "query": {
		"query_string": {
			  "query": "article:(login error)"
			, "minimum_should_match": 2
		}
	}
}

While this definitely works better, the results still are not good enough in a lot of cases.

Is there a better way to approach this issue?

I could feed the text into a generative text LLM and try to extract a search phrase from the content, but that’s slow. It feels like I’m already trying to solve a problem that has been done before with OpenSearch without LLMs.

Is there a better way to extract critical keywords from the text?

Ideally unique product names would be identified as critical keywords and carry more weight. Common words (which are not stop words), would be ignored or carry very little weight. So if we look at our simple example of Help! I'm trying to log in and keep getting an error. Thanks!, we would key in on log in and error as the critical keywords we should be trying to find. Words like “keep” and “getting” are not really going to help you find solutions.

Looking forward to hearing any suggestions to how to solve my problem efficiently!

Thanks!

Hello Dan.

  • the traditional approach for this problem: combined_field with very high boost for title field and low weight for article field.
  • you already stepped into query processing field, also title matches/or terms might be mandatory via bool/must
  • then you may use stopwords filter to exclude common terms
  • overal, you use occuring in title field as a sign of significance, but it might be any other linguistic model used in offline/indexing/query time (pick any) e.g. here’s just one of google results GitHub - MaartenGr/KeyBERT: Minimal keyword extraction with BERT (have no/just an idea)
  • what’s inductry is up to: embeddings/vectorsearch/knn. Pick a good embeddings model/provider (I think, good one is when there are separate models for query and document embeddings). Process docs, store their embeddinds; then turn query to vector, find the closest vectors treat them as an answer.

Thanks for the feedback! It’s very much appreciated.

Do you know of any well vetted English stop word lists that are geared towards this type of work?

hm … have no idea, just got some fun from ChatGPT

  1. the - Used for definite articles but often does not add specific information.

  2. is - A common verb for forming the present tense.

  3. of - Used to show belonging or composition, typically non-informative.

  4. and - A conjunction used to connect words or phrases.

  5. in - Used to indicate inclusion, location, or position within limits.

  6. to - Used for expressing motion in the direction of (a particular location).

  7. a - Indefinite article used before nouns.

  8. that - Can function as a demonstrative pronoun, relative pronoun, or conjunction.

  9. it - Typically refers back to a previously mentioned or understood object or subject.

  10. you - Second person pronoun, very common in any form of communication.

  11. for - Used to indicate the purpose of an action or object.

  12. on - Indicates position above something and in contact with it.

  13. with - Used to indicate having or possessing something.

  14. as - Used to describe the function or role of a person or thing.

  15. I - First person pronoun, common in conversations.

  16. be - An auxiliary verb used in various grammatical constructions.

  17. have - Indicates possession or necessity.

  18. this - Demonstrative pronoun often used to specify a noun.

  19. but - Used to introduce something contrasting with what has already been mentioned.
    Nouns

  20. problem - Often used to generally describe an issue without specifying details.

  21. issue - Similar to “problem,” it refers to something that isn’t working as expected.

  22. system - A general term for the set of hardware and software being discussed.

  23. error - Used to describe a fault or problem, but often without specific details.

  24. support - Refers to help or assistance.

  25. service - Often used to describe the support or functionality provided.

  26. customer - The person receiving support.

  27. team - Refers to the group of support agents or technical staff.

  28. information - A broad term for data provided by the customer or needed to solve an issue.

  29. account - Refers to the customer’s personal or user account but is often nonspecific.

Verbs

  1. help - A common verb in support, indicating the act of providing assistance.
  2. fix - Used to describe the act of resolving a problem but without detailing how.
  3. resolve - Similar to “fix,” it indicates finding a solution to an issue.
  4. check - Instructs or describes the act of investigating or examining something.
  5. update - Refers to refreshing or making changes to software or information.
  6. restart - Often advised as a first step to clear system errors.
  7. connect - Used when discussing network issues or linking devices.
  8. install - The act of setting up software or hardware.
  9. configure - Describes setting up or arranging systems or applications in a specific way.
  10. access - Describes the ability to enter or use a system or data.

Thanks!