Is it possible to get most popular search queries?

tomasborrella · April 26, 2024, 1:19pm

Does OpenSearch somehow store the queries that have been made? Is there a way to get the most popular queries or search trends?

jed326 · April 26, 2024, 4:57pm

This is something that is currently being worked on! Here is the RFC, feel free to share your thoughts on the issue and what you would like to see.

github.com/opensearch-project/OpenSearch

[RFC] Real-time Insights into Top N Queries by Latency and Resource Usage

opened 10:32PM - 13 Nov 23 UTC

ansjcy

enhancement RFC Performance Search:Query Insights

## Problem Statement Currently, OpenSearch lacks a direct means of providing …insights into “top queries” that have a significant impact on latency and resource usage (including CPU consumption, and memory usage etc). At present, users can utilize [OpenSearch's slow log](https://opensearch.org/docs/latest/monitoring-your-cluster/logs/#slow-logs) to obtain certain degree of insights into the execution time of slow queries. However, the slow log offers limited context for the queries, lacking ways to track back to the original requests and the corresponding full query body, and doesn't encompass any resource usage dimensions for the queries. Additionally, the current slow query log doesn't provide insights into resource usage and latency for **_individual_** phases of a query execution, preventing the answering of critical questions like, "Which phase of the slow search query consumes the most time?" Moreover, slow query logs lack an aggregated view of the "**Top N slow queries by a specific resource usage or latency**" over a specified time frame. These limitations make it challenging for OpenSearch users to identify, diagnose, and correlate slow queries, and to further optimize those queries to prevent negative impacts on overall system health. ## Objective The objective of this RFC is to propose and define the architecture, components, and key features of the “Top N queries” framework, aiming to provide real-time insights into the most impactful queries based on resource usage and latency. By aggregating crucial metrics such as query latencies, and with planned expansions to include CPU consumption and memory usage in future iterations, this feature will empower users to identify and optimize resource-intensive queries, thereby enhancing the debugging and tracing experience. The framework's scope encompasses the following key aspects: * Implementation of a **priority queue-based** in-memory data store, with configurable window size, on the coordinator node, designed to efficiently store the top N queries. The data model of the stored query attributes should be highly extensible for different types of resources and metrics. * Introduction of generic and extensible metrics collection workflows capable of gathering per-request latency and resource usage metrics from diverse sources. In the first phase, the initial focus will be on implementing a generic collection method for query _**latency**_ metrics. Future iterations will extend this method to encompass additional metrics such as CPU usage, leveraging the per-request resource collection capability from the [Resource Tracking Framework](https://github.com/opensearch-project/OpenSearch/issues/1179). * Support for exporting the Top-N queries results to different sinks on a predefined schedule (push model). Initial iterations will include basic sinks like log on disk, while subsequent versions will introduce support for advanced sinks such as OPTL or other databases. * Support for related APIs to facilitate querying of Top-N query results (Pull model). Providing an API to fetch those data would offer customers greater flexibility for analysis, with future potential for integration into dashboards, enhancing usability and analysis capabilities. With the above framework components, a typical workflow for the top N query feature would be: 1. Per-request metrics (latency and resource usages) are collected and sent to the in-memory data store on the coordinator node, The data store aggregates these metrics, maintaining a record of "queries with top resource usages." Granularity levels, such as per minute, per hour, and per day, can be configured based on user preferences. 2. The "queries with top resource usages" data is publish to different sinks at a schedule based on user’s configuration. 3. Customer can query the "queries with top resource usages" data via an API for further analysis. <p align="center"> <img src="https://github.com/opensearch-project/OpenSearch/assets/7891523/acd50987-a048-4a04-a264-ba5e1f89877f" width="600"> </p> <p align="center"> <img src="https://github.com/opensearch-project/OpenSearch/assets/7891523/fe546f83-17b1-45cd-8aad-5ed6daac520a" width="600"> </p> ## Detailed Solution ### Data model to store query information The data model designed for storing query information should contain essential data for those query requests, such as _search query type, total shards, and indices involved in the search request_. Moreover, the data model should be structured to accommodate customized fields, facilitating adding other customized properties in the future to enhance extensibility. Additionally, the data model for the query data should feature a comprehensive breakdown of latency and resource usage across individual phases of the query execution. The data models to store query information are outlined below. ```java // based class to store request information for a request abstract class RequestData { private SearchType searchType; private int totalShards; private String[] indices; // property map to store future customized attributes for the // requests. This is for extensibility consideration. // for example, we can add user-account which initialized the request in the future private Map<String, String> propertyMap; // Getters and Setters. } // child class to store latency information for a request class RequestDataLatency extends RequestData { Map<String, Long> phaseLatencyMap; Long totalLatency; // Getters and Setters. } // child class to store cpu resource usage information for a request class RequestDataCPU extends RequestData { Map<String, float> phaseResourceUsageMap; float totalResourceUsage; // Getters and Setters. } ``` ### In-memory data store on Coordinator node The proposed data store, `SearchQueryAnalyzer`, is a generic, extendible, priority queue based Object for tracking the "Top N queries" based on specific resource usage within a defined time window. As previously mentioned, this data store should provide support for exporting data to various sinks and also offers a dedicated method for querying data via an API. ```java abstract class SearchQueryAnalyzer { // Number of top queries to keep private int topNSize; // window size in seconds private int windowSize; // getter and setters } public class SearchQueryLatencyAnalyzer extends SearchQueryAnalyzer { PriorityQueue<RequestDataLatency> topRequests; // Async function to ingest latency data for a request public void ingestDataForQuery(SearchRequest request, Map<String, String> customizedAttributes, Long latency){}; public List<RequestDataLatency> getTopLatencyQueries(){}; public void export(Sink sinkName){}; } public class SearchQueryCPUAnalyzer extends SearchQueryAnalyzer { PriorityQueue<RequestDataCPU> topRequests; // Async function to ingest cpu usage data for a request public void ingestDataForQuery(SearchRequest request, Map<String, String> customizedAttributes, float cpuUsage){}; public List<RequestDataCPU> getTopNCpuUsageQueries(){}; public void export(Sink sinkName){}; } ``` It's crucial to note that the `topN` and `windowSize` value for each analyzer store should be customizable through a configuration file. To enhance flexibility, an API will also be provided, allowing users to dynamically adjust these configuration values without restarting the OpenSearch process. Further details regarding the configuration API will be elaborated in the "APIs for Configurations" section below. ### Metrics Collection Different metrics collection workflow will be implemented tailored to different type of metrics. The initial focus will be on the metrics collection workflow for latency metrics (considering latency as the primary resource usage dimension in the first iteration), However, future iterations can incorporate additional dimensions, such as CPU usage, based on metrics available from the [resource tracking framework](https://github.com/opensearch-project/OpenSearch/issues/1179) To capture per-request phase latency information, a listener-based approach will be employed. Based on the [newly introduced framework to track search query latency](https://github.com/opensearch-project/OpenSearch/issues/7334), a new component, `SearchRequestResourceListener`, will be defined. This relevant listener functions will be executed during each phase of search queries, recording latency and storing all relevant information into the `SearchQueryLatencyAnalyzer`. The high-level class design is presented below. ```java public class SearchRequestResourceListener implements SearchRequestOperationsListener { private long startTime; // Query phase void onQueryPhaseStart(SearchPhaseContext context) { // record the timestamp when query phase starts } void onQueryPhaseEnd(SearchPhaseContext phase, long tookTime) { // calculate and store the latency for this phase } void onQueryPhaseFailure(SearchPhaseContext context) {} void onRequestStart(SearchPhaseContext context) { // Set the start time } void onRequestEnd(SearchPhaseContext context) { // Report latency SearchQueryLatencyAnalyzer.ingest(); } } ``` ### Data export - pull model To facilitate user access to the "top queries with resource usage" information, we will implement dedicated API endpoints in OpenSearch. The structure of these API endpoints is outlined below: #### Top Queries - general API **Endpoint:** `GET /insights/top_queries` **Parameters:** * `type`: Type of the metric that top n queries are calculated on **Example Usage:** `GET /insights/top_queries?type=latency` **Response:** Upon querying the `GET /insights/top_queries?type=latency` API, the response will be a structured set of data providing insights into the top queries based on latency within the specified parameters: ```js { "timestamp": "2023-11-09T12:30:45Z", "top_queries": [ { "query_id": "12345", "latency": 600, "timestamp": "2023-11-09T12:29:15Z", "search_type": "QUERY_THEN_FETCH", "indices": [ "index1", "index2" ], "shards": 5, "phases_details": { "query": 100, "fetch": 200, "expand": 300, }, // Additional customized attributes details specific to the query "attributes": { "user_id": "value" } }, // Additional top queries results based on latency ] } ``` #### Top Queries by CPU resource usage (future iterations): **Endpoint example:** `GET /insights/top_queries?type=cpu` **Parameters:** * `type`: Type of the metric that top n queries are calculated on, in this case, it's CPU usage **Response:** The response will contain details about the top queries based on CPU usage within the specified time window and limit. ```js { "timestamp": "2023-11-09T12:30:45Z", "top_queries": [ { "query_id": "12345", "cpu_usage": 0.6, "timestamp": "2023-11-09T12:29:15Z", "search_type": "QUERY_THEN_FETCH", "indices": [ "index1", "index2" ], "shards": 5, "phases_details": { "query": 0.1, "fetch": 0.2, "expand": 0.3, }, // Additional customized attributes details specific to the query "attributes": { "user_id": "value" } }, // Additional top queries results based on latency ] } ``` The below sequence diagram illustrates the workflows and detailed interactions among various components within the OpenSearch backend when the “Top N queries” feature is enabled. It captures the process of calculating and storing latency when OpenSearch processes a search query, and also how the results will be returned when queried. ![](https://github.com/opensearch-project/OpenSearch/assets/7891523/e50a213c-8906-443c-ad2a-eacd0cb7bfa1) ### Data export - push model The Java Scheduler can be used to automate the export of Top N queries at scheduled intervals, pushing the data to various sinks. In the initial iteration, we will establish support for a fundamental sink, enabling the writing of top N queries to a log file on disk. Subsequent iterations will introduce more advanced sinks, enhancing the export capabilities. Examples of advanced sinks include OPTL and SQL/time series databases. This iterative approach allows this feature to evolve and accommodate a broader spectrum of user needs, expanding the export functionality to encompass diverse and advanced storage and analysis solutions in the future. ### APIs for Configurations As previously mentioned, The `topN` and `windowSize` value and the data export interval for each `SearchQueryAnalyzer` store should be highly configurable. This can be done through a configuration file that is read when the OpenSearch process starts. Additionally, an API will be provided to dynamically configure these values without requiring a restart of the OpenSearch process. The proposed API endpoint to configure the `windowSize`, `topN` value and `export_interval` is described as follows: ```js PUT /_cluster/settings{ "persistent":{ "search.top_n_queries.latency.enabled" : "true", "search.top_n_queries.latency.window_size" : 60, "search.top_n_queries.latency.top_n_size" : 5, "search.top_n_queries.latency.export_interval" : 30, } } ```

KateWinslet · May 29, 2024, 6:28am

Yes, OpenSearch can store queries made. You can retrieve the most popular queries or search trends by accessing the search logs and analyzing them. agencia de marketing madrid

chenyangji · March 10, 2025, 8:51pm

We have released the query insights backend and dashboards plugin for enhancing the query performance visibility. one of the functionalities is tracking the top n queries by latency, cpu and memory usages.

Check out GitHub - opensearch-project/query-insights: Query Insights plugin offers frameworks and APIs for analyzing and optimizing query performance in OpenSearch., and the official doc about the features! Query insights - OpenSearch Documentation

Topic		Replies	Views
2022: The Year in Review Community discuss	10	1196	January 2, 2023
[RFC] Search User Behavior Logging and Data Reuse for Relevance Request For Comments discuss	2	1091	February 10, 2023
OpenSearch Benchmark - Most important metrics? OpenSearch discuss	1	536	February 24, 2023
Opensearch Dashboards shows 10000 as hits.total OpenDistro troubleshoot	33	16150	February 11, 2022
2023: The Year in Review Community	15	993	January 1, 2024

Is it possible to get most popular search queries?

Related topics