[RFC] Search User Behavior Logging and Data Reuse for Relevance

markcohen · January 20, 2023, 3:27pm

We could use your input on this RFC please!

We believe that there is a huge opportunity to manually and automatically refine search results by collecting data from outside (clickstream data) and inside (analyzers, rewrites, reranking, etc.) OpenSearch and would like to hear from users about how this might benefit them. What painpoints would there be for combining internal OpenSearch data with clickstream analytics?

github.com/opensearch-project/OpenSearch

[RFC] Search User Behavior Logging and Data Reuse for Relevance

opened 04:08PM - 28 Sep 22 UTC

macohen

feature Indexing & Search

## What/Why ### What are you proposing? Currently, there is no way for users o…f OpenSearch to get a full picture of how search is being used without building their own logging and metrics collection system. This is a request for comments to the community to discuss needs for a standardized logging schema & collection mechanism. We want to work with the community to understand where we can make the most impactful improvements to help the most users in understanding how search is used in their applications and how they can tune results most effectively. We believe that application builders using OpenSearch for e-commerce, product, and document based search have a common set of needs in how they collect and expose data for analytics and reuse. Regarding analytics, we believe builders, business users, and relevance engineers want to see metrics out of the box for any search application like top queries, top queries resulting in a high value action (HVA - like a purchase, stream, download, or whatever the builder defines), top queries with zero results, top abandoned queries, as well as more advanced analytics like similar queries in the long tail that may be helped by synonyms, query rewrites/expansion or other relevance tuning techniques. This same data can also be re-used to feed manual judgement and automated learning to improve relevance in the index. ### What users have asked for this feature? _Highlight any research, proposals, requests or anecdotes that signal this is the right thing to build. Include links to GitHub Issues, Forums, Stack Overflow, Twitter, Etc_ ### What problems are you trying to solve? Template: When \<a situation arises> , a \<type of user> wants to \<do something>, so they can \<expected outcome>. (Example: When **searching by postal code**, **a buyer** wants to **be required to enter a valid code** so they **don’t waste time searching for a clearly invalid postal code.**)_ * When any search results are returned, search application builders want to report on the top requested queries so that they can learn about what their users intend to find. * When users search for content, a search relevance engineer wants to feed behavioral data back into the search system for automatic reranking. * When users search for content, a search relevance engineer wants to feed behavioral data back into the search system for manual tuning of search results. ### What is the developer experience going to be? _Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature._ * Allow the user to submit an optional field containing the original, user typed query. Track that original query through all steps of querying the index: user typed 1) query -> 2) rewritten query -> 3) results from OpenSearch -> 4) reranked results outside of OpenSearch -> 5) actions taken by the end users (query again, abandon search, some other high value action). * Initially, we are focused on adoption so even if we started from the inside out with #2 and #3 above, it would be helpful. The API change would be providing a place in the query DSL to optionally submit the original query. We could build that in as well, but only include it in logging and analysis if it is there. #### Are there any security considerations? _Describe if the feature has any security considerations or impact. What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here._ * New data will be logged inside OpenSearch. Possible injection attacks could occur. #### Are there any breaking changes to the API _If this feature will require breaking changes to any APIs, ouline what those are and why they are needed. What is the path to minimizing impact? (example, add new API and deprecate the old one)_ ### What is the user experience going to be? _Describe the feature requirements and or user stories. You may include low-fidelity sketches, wireframes, APIs stubs, or other examples of how a user would use the feature via CLI, OpenSearch Dashboards, REST API, etc. Using a bulleted list or simple diagrams to outline features is okay. If this is net new functionality, call this out as well._ #### Are there breaking changes to the User Experience? _Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?_ * No breaking changes ### Why should it be built? Any reason not to? _Describe the value that this feature will bring to the OpenSearch community, as well as what impact it has if it isn't built, or new risks if it is. Highlight opportunities for additional research._ * Building this feature will standardize a set of reporting and data collection needs that are common across search applications and allow software engineers and relevance engineers to focus on higher level concerns out of the box like tuning queries, query rewriting, synonyms, and results reranking. * If it isn't built, users will either have no insights into search results and how to tune them, they will keep building analytics and data collection applications without getting an understanding of what is happening inside OpenSearch. * If it is built, one technical concern is trade offs between adding latency to OpenSearch and adding complexity to the platform. Logging every request and each step like rewrites, results returned from the index, reranking, and HVAs could have impact on an OpenSearch cluster if we decide to do all of this in OpenSearch. On the other hand adding a whole new set of infrastructure to deal with this level of data collection, even with a separate OpenSearch cluster adds complexity to the architecture. ### What will it take to execute? _Describe what it will take to build this feature. Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?_ ### Any remaining open questions? _What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation._ #### Questions for the Community * Do you have first (homegrown) or third party analytics tools like Google Analytics, Adobe, or others? Would it make sense for us to connect the logging and metrics we propose to deliver inside OpenSearch with the clickstream/application metrics you have in those other systems? #### Review & Validate this Proposal for tracking data through OpenSearch: https://github.com/opensearch-project/search-relevance/issues/12

Topic		Replies	Views
OpenSearch Community Meeting - 2024-0130 Community community-meeting	5	723	February 1, 2024
[RFC] Search Pipelines Request For Comments discuss , feature-request , rfc	11	1343	July 22, 2023
Search application builders: How can OpenSearch better support you? Request For Comments all-clients	5	608	March 28, 2023
Looking for feedback on OpenSearch document search use cases OpenSearch discuss	0	28	September 15, 2024
Search & Search Relevance Backlog & Triage - 2023-09-06 Community discuss , community-meeting	0	239	September 6, 2023

[RFC] Search User Behavior Logging and Data Reuse for Relevance

Related topics