Optimizing Composite Aggregation Performance with Calculated Metrics, Sorting, and Pagination

Versions

Environment: OpenSearch Version: 3.1
Instance Type: r8gd.2xlarge.search
Data Nodes: 6
Index Setup: 26 indices under one alias
Data Volume: ~46GB per index (~1.2TB total)
Documents: ~40,129,419 documents per index (~1.04B total)

Describe the issue:

I’m building an analytics dashboard with SSR that requires:

Pagination (e.g., first 20 aggregated results)
Sorting by calculated metrics (e.g., gross_margin, total_sales)
Multi-dimensional aggregation (e.g., delivery_method + delivery_country_code)
Real-time calculation of derived metrics (currently done client-side)

Current Implementation
I’m using composite aggregation with nested aggregations. Each iteration takes ~3 seconds, and with some users requiring 5 iterations, total time reaches ~15 seconds.

Example Query Structure:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "user_id": "1234" } },
        { "range": { "order_date": { "gte": "2025-01-01", "lte": "2025-01-25" } } }
      ]
    }
  },
  "aggs": {
    "buckets": {
      "composite": {
        "size": 5000,
        "sources": [
          { "delivery_method": { "terms": { "field": "delivery_method.keyword" } } },
          { "delivery_country_code": { "terms": { "field": "delivery_country_code.keyword" } } }
        ]
      },
      "aggs": {
        "salesAggs": {
          "filter": { "range": { "order_date": { "gte": "2025-01-01", "lte": "2025-01-25" } } },
          "aggs": {
            "total_sales": { "sum": { "field": "total_sales" } },
            "total_sales_gross": { "sum": { "field": "total_sales_gross" } },
            "order_product_cost": { "sum": { "field": "order_product_cost" } },
            "order_delivery_cost": { "sum": { "field": "order_delivery_cost" } },
            "order_marketplace_fee": { "sum": { "field": "order_marketplace_fee" } },
            "orders": {
              "scripted_metric": {
                "init_script": "state.orders = new HashSet()",
                "map_script": "state.orders.add(doc['order_id'].value)",
                "combine_script": "return state.orders",
                "reduce_script": "def allOrders = new HashSet(); for (state in states) { if (state != null) { allOrders.addAll(state); } } return allOrders.size()"
              }
            },
            "pickedItem": { "top_hits": { "size": 1 } }
          }
        },
        "returnsAggs": {
          "filter": { "range": { "order_product_return_date": { "gte": "2025-01-01", "lte": "2025-01-25" } } },
          "aggs": {
            "order_product_return_amount": { "sum": { "field": "order_product_return_amount" } },
            "returns": {
              "scripted_metric": {
                "init_script": "state.orders = new HashSet()",
                "map_script": "state.orders.add(doc['order_id'].value)",
                "combine_script": "return state.orders",
                "reduce_script": "def allOrders = new HashSet(); for (state in states) { if (state != null) { allOrders.addAll(state); } } return allOrders.size()"
              }
            }
          }
        }
      }
    }
  }
}

Document Structure (Sample)

{
  // Dimensions
  order_id: string,
  user_id: string,
  delivery_method: string,
  delivery_country_code: string,
  order_product_id: string,
  order_product_name: string,
  
  // Metrics (stored in index)
  total_sales: string,              // numeric as string
  total_sales_gross: string,
  order_product_cost: string,
  order_delivery_cost: string,
  order_marketplace_fee: string,
  order_product_marketplace_fee: string
}

Additional Calculations (Client-Side)
After receiving aggregated results from OpenSearch, I perform client-side calculations on each bucket object (potentially up to 400k objects):

gross_margin = total_sales - (order_product_cost + order_delivery_cost + order_marketplace_fee + order_product_marketplace_fee)
gross_margin_ratio = (total_sales - total_cost) / total_sales
average_order_value = total_sales / unique_orders_count (where unique_orders_count comes from scripted_metric)
After calculating these metrics, I sort the results client-side and display them in TanStack Table. This approach works but becomes inefficient with large datasets, especially when I need to:

Sort by calculated metrics (e.g., gross_margin)
Paginate efficiently (e.g., show first 20 results sorted by gross_margin)
Performance Observations
Fast aggregations: Small dimensions with few unique values (e.g., delivery_method ~10 values)
Slow aggregations: Large dimensions with many unique values (e.g., order_id, order_product_id with millions of values)
scripted_metric requirement: Using scripted_metric for exact orders count (cannot use cardinality due to accuracy requirements)

Questions

  1. Can I calculate derived metrics (like gross_margin) directly in the aggregation query instead of client-side? If so, how?
  2. Can I sort composite aggregation buckets by calculated metrics (e.g., sort by gross_margin descending)?
  3. How can I paginate composite aggregation results efficiently (e.g., get only first 20 buckets sorted by a calculated metric)?
  4. Are there optimizations for scripted_metric aggregations in composite queries with large datasets?
  5. I tried the SQL plugin, but COUNT DISTINCT converts to cardinality, which isn’t accurate enough. Are there alternatives?
  6. Would pre-calculating gross_margin and storing it in the document provide significant performance benefits, or can this be handled efficiently in the query?
  7. Are there best practices for composite aggregation with nested aggregations at this scale?

Any guidance on optimizing this use case would be greatly appreciated. Thank you!