Extracting metrics for KPI creation related to search performance

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenSearch 3.1.0
OpenSearch Dashboards 3.1.0
RHEL 8

Describe the issue:

Hello,

I’m looking for some guidance on the most effective approach to extract search performance KPIs from our OpenSearch cluster and would appreciate insights from the community.

Our Environment

  • OpenSearch cluster primarily ingesting logs from multiple applications. Data from different applications can be anywhere from 200-500mb, to 2TB per 24h
  • Users primarily use the platform for troubleshooting and debugging workflows
  • Users typically perform simple searches (query strings, time-based filtering) and view dashboard visualizations
  • Users are, in general, not well equipped with the knowledge necessary to optimize their queries (e.g. most users simply go to the discover and search for a specific query string without filtering for a specific field, for example). These are teams that use OpenSearch as a means to an end, with very little consideration for actually understanding the application itself

We want to establish baseline performance KPIs before implementing cluster optimizations, then measure the impact of changes. The challenge is determining the best methodology for consistent, meaningful performance measurements.

Our initial idea was to keep it relatively simple - pick a sample size of the X (5-10) index patterns which incorporate the most relevant time/rollover-based indexes and define KPIs for each. Our plan is sketched as follows:

  1. Create a collection of different searches and aggregations for each index, based on:
    1.1 Queries that users use the most
    1.2 Queries/aggregations being used on visualizations

We would then run those queries against each index pattern, using two different metrics: Set number of documents to search for, and different time periods. Running each query/aggregation X number of times, document the values (e.g. latency, query/fetch times, etc), and move to the next metric.

Then we would perform the same steps for all index patterns, consolidate all the data. Perform all the necessary cluster configurations/index configuration changes, and rerun the tests to compare the results.

We have tried a couple of solutions to try and understand which tool would be the best for our use case:

  • Using profile when running the queries
  • Using the query-insights plugin
  • Using the opensearch-benchmark tool

Although the tools are somewhat relevant for what we’re trying to achieve, it seems none of them are actually tailored for our use case, or we’re completely missing something right under our noses. We also would like to remove as many skewering factors as possible from the KPI measurements, such as caching.

So I’m creating this topic to ask for some community feedback to understand if we’re using the correct tools for the situation. And any additional feedback or suggestions are most welcomed.

Thank you.