Versions
Environment: OpenSearch Version: 3.1
Instance Type: r8gd.2xlarge.search
Data Nodes: 6
Index Setup: 26 indices under one alias
Data Volume: ~46GB per index (~1.2TB total)
Documents: ~40,129,419 documents per index (~1.04B total)
Describe the issue:
I’m building an analytics dashboard with SSR that requires:
Pagination (e.g., first 20 aggregated results)
Sorting by calculated metrics (e.g., gross_margin, total_sales)
Multi-dimensional aggregation (e.g., delivery_method + delivery_country_code)
Real-time calculation of derived metrics (currently done client-side)
Current Implementation
I’m using composite aggregation with nested aggregations. Each iteration takes ~3 seconds, and with some users requiring 5 iterations, total time reaches ~15 seconds.
Example Query Structure:
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "term": { "user_id": "1234" } },
{ "range": { "order_date": { "gte": "2025-01-01", "lte": "2025-01-25" } } }
]
}
},
"aggs": {
"buckets": {
"composite": {
"size": 5000,
"sources": [
{ "delivery_method": { "terms": { "field": "delivery_method.keyword" } } },
{ "delivery_country_code": { "terms": { "field": "delivery_country_code.keyword" } } }
]
},
"aggs": {
"salesAggs": {
"filter": { "range": { "order_date": { "gte": "2025-01-01", "lte": "2025-01-25" } } },
"aggs": {
"total_sales": { "sum": { "field": "total_sales" } },
"total_sales_gross": { "sum": { "field": "total_sales_gross" } },
"order_product_cost": { "sum": { "field": "order_product_cost" } },
"order_delivery_cost": { "sum": { "field": "order_delivery_cost" } },
"order_marketplace_fee": { "sum": { "field": "order_marketplace_fee" } },
"orders": {
"scripted_metric": {
"init_script": "state.orders = new HashSet()",
"map_script": "state.orders.add(doc['order_id'].value)",
"combine_script": "return state.orders",
"reduce_script": "def allOrders = new HashSet(); for (state in states) { if (state != null) { allOrders.addAll(state); } } return allOrders.size()"
}
},
"pickedItem": { "top_hits": { "size": 1 } }
}
},
"returnsAggs": {
"filter": { "range": { "order_product_return_date": { "gte": "2025-01-01", "lte": "2025-01-25" } } },
"aggs": {
"order_product_return_amount": { "sum": { "field": "order_product_return_amount" } },
"returns": {
"scripted_metric": {
"init_script": "state.orders = new HashSet()",
"map_script": "state.orders.add(doc['order_id'].value)",
"combine_script": "return state.orders",
"reduce_script": "def allOrders = new HashSet(); for (state in states) { if (state != null) { allOrders.addAll(state); } } return allOrders.size()"
}
}
}
}
}
}
}
}
Document Structure (Sample)
{
// Dimensions
order_id: string,
user_id: string,
delivery_method: string,
delivery_country_code: string,
order_product_id: string,
order_product_name: string,
// Metrics (stored in index)
total_sales: string, // numeric as string
total_sales_gross: string,
order_product_cost: string,
order_delivery_cost: string,
order_marketplace_fee: string,
order_product_marketplace_fee: string
}
Additional Calculations (Client-Side)
After receiving aggregated results from OpenSearch, I perform client-side calculations on each bucket object (potentially up to 400k objects):
gross_margin = total_sales - (order_product_cost + order_delivery_cost + order_marketplace_fee + order_product_marketplace_fee)
gross_margin_ratio = (total_sales - total_cost) / total_sales
average_order_value = total_sales / unique_orders_count (where unique_orders_count comes from scripted_metric)
After calculating these metrics, I sort the results client-side and display them in TanStack Table. This approach works but becomes inefficient with large datasets, especially when I need to:
Sort by calculated metrics (e.g., gross_margin)
Paginate efficiently (e.g., show first 20 results sorted by gross_margin)
Performance Observations
Fast aggregations: Small dimensions with few unique values (e.g., delivery_method ~10 values)
Slow aggregations: Large dimensions with many unique values (e.g., order_id, order_product_id with millions of values)
scripted_metric requirement: Using scripted_metric for exact orders count (cannot use cardinality due to accuracy requirements)
Questions
- Can I calculate derived metrics (like gross_margin) directly in the aggregation query instead of client-side? If so, how?
- Can I sort composite aggregation buckets by calculated metrics (e.g., sort by gross_margin descending)?
- How can I paginate composite aggregation results efficiently (e.g., get only first 20 buckets sorted by a calculated metric)?
- Are there optimizations for scripted_metric aggregations in composite queries with large datasets?
- I tried the SQL plugin, but COUNT DISTINCT converts to cardinality, which isn’t accurate enough. Are there alternatives?
- Would pre-calculating gross_margin and storing it in the document provide significant performance benefits, or can this be handled efficiently in the query?
- Are there best practices for composite aggregation with nested aggregations at this scale?
Any guidance on optimizing this use case would be greatly appreciated. Thank you!