## Background
OpenSearch currently supports four primary methods for extracting …telemetry data to facilitate cluster management and internal observability: (1) Aiven’s Prometheus Exporter Plugin for OpenSearch, (2) the Prometheus Community Elasticsearch Exporter, (3) the native OpenSearch Telemetry Framework, which is based on OpenTelemetry, and (4) OpenSearch’s Performance Analyzer plugin. Each of these approaches was developed with a common goal—to simplify operations for OpenSearch clusters. Today, the open-source community around OpenSearch relies heavily on these solutions. This document aims to compare these three approaches and define a path forward for continued support and enhancement of telemetry extraction in OpenSearch, ultimately making operations and cluster management more seamless.
### Requirements:
1. Empower the OpenSearch community to export telemetry data and monitor their clusters effectively
1. Ensure minimal performance impact on OpenSearch cluster
1. Support flexible export mechanisms; a. direct scraping endpoint, b. OpenTelemetry collector integration, c. local file output
1. Allow plugins to easily integrate with a telemetry framework
1. Minimize deployment and maintenance overhead
## Comparison between existing solutions
### 1. Aiven’s Prometheus Exporter Plugin for OpenSearch
This is an OpenSearch cluster plugin (written in Java) that exposes internal metrics at an HTTP endpoint for Prometheus to scrape. It runs inside each OpenSearch node.
**Architecture:** The plugin is installed on every data node; it gathers node stats and serves them in Prometheus format. Prometheus (or another scraper) polls each node’s `/ _prometheus/metrics` endpoint. The metrics are stored in Prometheus TSDB (or long-term storage) and visualized/alerted in Grafana or OpenSearch Dashboards.
<p align="center">
<img src="https://github.com/user-attachments/assets/296101f5-e417-4538-9b18-3bb541c368a3" width="928" height="275">
</p>
Each OpenSearch node (with the Aiven plugin) exposes a Prometheus endpoint. A Prometheus server scrapes all nodes, stores data, and serves Grafana/Alertmanager
#### Pros:
- Tight integration: Runs inside OpenSearch, so has direct access to internal metrics (and can expose any Java-level stats).
- All-OS-metrics coverage: Can expose node-, index-, and cluster-level metrics exactly as seen internally.
- Prebuilt dashboards and rules: The project includes a “mixin” of Prometheus alerting rules and Grafana dashboards for OS https://monitoring.mixins.dev/
- No external service needed: Operates as part of the cluster; no separate exporter process.
#### Cons:
- Performance overhead: Running on each node adds CPU/memory overhead. Metrics endpoint scraping can load master nodes if too frequent. This is not really a con per se.
- Integration: Works natively with Prometheus (exposes Prometheus-format metrics) and thus with Grafana and Prometheus Operator. Dashboards for Grafana/OSD can directly query these metrics. OpenTelemetry tooling is not directly used here (though metrics can be pulled via an OpenTelemetry Collector and pushed over OTLP if desired).
- Narrow scope: Exposes metrics only (no logs/traces). Alerts and dashboards must be defined separately (though the mixin helps).
- Extra Installation: The approach requires the installation of an external plugin which, depending on the OS deployment automation, is an additional effort.
### 2. Prometheus-Community Elasticsearch Exporter
This is an external Go service originally for Elasticsearch, now also used with OpenSearch . It scrapes OpenSearch’s REST stats APIs and exposes them in Prometheus format. It runs independently (e.g. in Kubernetes or alongside clients)
**Architecture:** A single exporter service (pod/daemon) polls the OpenSearch cluster (via the standard REST endpoints) to collect cluster, node, and index stats. It then exposes a Prometheus endpoint (default port 9114). A Prometheus server scrapes the exporter, stores metrics, and feeds Grafana/Alertmanager.

An external exporter queries OpenSearch over HTTP (Stats APIs). Prometheus scrapes the exporter’s / metrics endpoint and stores data for Grafana/alerts.
#### Pros:
- Decoupled: No plugin installation needed; one exporter can scrape many nodes and all versions. It runs outside the cluster (e.g. in Kubernetes), so cluster restarts or upgrades do not require managing the exporter.
- Feature-rich: It already handles many metrics (more than the plugin, including cluster version, doc counts, etc. ) and integrates with Prometheus ecosystem (helm charts, Grafana dashboards).
- Flexibility: You can filter/export subsets (e.g. indices only), rename metrics prefix (active proposal for “opensearch_” prefix to avoid confusion). It can be packaged as a container, making Prometheus Operator integration straightforward.
- Prometheus-native: Fits seamlessly into Prometheus setups (ServiceMonitor, alerting rules, etc.).
#### Cons:
- Less native to OpenSearch: It relies on the REST Stats APIs which may not include newer or plugin-specific metrics. For example, OpenSearch-specific features (like index state management, kNN, ML metrics) might not be scraped until exporters are updated.
- Dependency risk: If OpenSearch or ES APIs diverge in the future, the exporter must adapt; otherwise a fork will be needed
- Scope: Only collects metrics (no logs/trace). Alert rules/dashboards must be provided separately (though many “awesome” Prometheus alerts exist)
- Integration: Native Prometheus integration: it emits Prometheus metrics, and can be deployed via Prometheus Operator (ServiceMonitor) with official helm charts . Grafana dashboards can query Prometheus or OpenSearch Dashboards (with SQL plugin). It does not involve OpenTelemetry itself (though one could route Prometheus metrics into OpenTelemetry Collector if desired). Alerts are handled by Prometheus Alertmanager or by sending metrics to OpenSearch for built-in observability.
### 3. OpenSearch Native Telemetry Framework
OpenSearch is developing a built-in metrics framework (powered by OpenTelemetry) to collect and export metrics. This is an experimental feature (as of OS 2.11+) that instruments core and plugins and can export metrics and traces via OTLP or store it on disk.
**Architecture:** Inside each OpenSearch node, the telemetry-otel plugin collects internal metrics (counters, histograms, etc.) and can export them. Supported exporters include a logging exporter (to write _otel_metrics.log ) and an OTLP gRPC exporter (sending metrics to a local OpenTelemetry Collector at localhost:4317 ) . Typically one would run an OTel Collector (or Data Prepper with OTel receiver) on each node or centrally to receive these metrics, then forward to a backend (OpenSearch index or external TSDB). For example, Data Prepper can ingest OTel metrics and index them into OpenSearch’s metrics index, which can be queried via OpenSearch Dashboards’ Observability → Metrics UI . Alternatively, the Collector might export to Prometheus (by running a Prometheus receiver) or to cloud monitoring.

OpenSearch nodes emit metrics and traces via the telemetry-otel plugin. An OpenTelemetry Collector ingests these (via gRPC) and exports them to OpenSearch or Prometheus. Dashboards/alerts then consume from the chosen storage.
Currently, the distributed tracing feature generates traces and spans for HTTP requests and a subset of transport requests. These traces and spans are initially kept in memory using the OpenTelemetry BatchSpanProcessor and are then sent to an exporter based on configured settings. The following are the key components: Span Processor and Exporter. This framework also supports head and tail sampling.
#### Pros:
- Extensible: Built by OpenSearch core team; metrics are first-class citizens. The Metrics Framework can instrument any plugin or feature via the OTel SDK . It supports rich metric types (histograms, percentiles) that the old Stats API lacked.
- Unified telemetry path: Because it’s based on OpenTelemetry, in principle the same infrastructure can handle logs, metrics, and traces, easing integration. Future anomaly detection and observability features (e.g. integrated dashboards) could tap into this flow
- No external exporter needed: Metrics are exported over the industry-standard OTLP. Users can send them to any OTel-compatible backend (Prometheus by running an exporter, or directly into OpenSearch).
- Core + plugin coverage: Any feature instrumented with OTel will automatically be collected, eliminating gaps. Eventually, all OpenSearch metrics could use the same framework.
#### Cons:
- Experimental state: Currently disabled by default (feature-flag) and is not production ready. Users must enable JVM flags and install the telemetry-otel plugin. Performance and stability impact as well as memory requirements are also unknown, making production expectations unclear.
- Complex setup: Requires running an OTel Collector or Data Prepper to handle the OTLP stream. The default exporter is to a log file, which needs ingestion to be useful. Users unfamiliar with OTel pipelines must configure collectors and index templates
- Limited tooling today: No Prometheus-native endpoint. To use with Prometheus, one must route metrics through an OTel Collector and use a Prometheus exporter or pushgateway.
- Integration: This path is native OpenTelemetry, not Prometheus. Prometheus is not directly used unless you insert a bridge. Metrics can be sent via OTLP to an OTel Collector, which could have a prometheus_exporter extension to allow Prometheus to scrape (or use OpenSearch as the backend). Grafana can visualize OTel metrics if they end up in a supported store (e.g. Grafana Cloud’s Prometheus or an OpenSearch index via OSD). The Prometheus Operator is not relevant unless one runs a Prometheus receiver. For dashboards/alerts, one would either use OpenSearch Dashboards Metrics UI (once metrics are indexed) or an external system that ingests OTel. In short, this approach is built for OpenTelemetry; users will likely pair it with an OTel Collector and Data Prepper/OS ingestion pipelines . Prometheus/Grafana can be used on the stored metrics but require extra steps (e.g. exporting metrics from Data Prepper to Prometheus).
### 4. Performance Analyzer plugin
Runs on each OpenSearch node, collecting metrics and storing them in shared memory. Exposes collected metrics on port 9600, which can be accessed by tools like PerfTop or custom exporters.
**Architecture:** Performance Analyzer Agent runs on each node in the cluster and collects performance metrics at regular intervals. It uses the Java Management Extensions (JMX) API to gather JVM and operating system metrics, as well as the OpenSearch REST API to collect cluster-specific metrics. Performance Analyzer REST API provides an interface for users to query the performance metrics collected by the Performance Analyzer Agent. Users can request metrics for specific nodes, indices, or time ranges, and the API returns the data in JSON format.
**PerfTop CLI:** PerfTop is a command-line interface (CLI) tool that allows users to visualize the performance metrics collected by the Performance Analyzer Agent in real-time. It queries the Performance Analyzer REST API and displays the results in a user-friendly, customizable dashboard.
<p align="center">
<img src="https://github.com/user-attachments/assets/80aae212-3ac4-4aba-ac7b-dd4bba6260fa" width="500" height="300">
</p>
#### Pros:
- Deep Native Integration: As an official OpenSearch plugin, PA offers seamless access to internal metrics, including JVM, thread pools, garbage collection, disk I/O, and network usage.
- Root Cause Analysis (RCA): PA includes an RCA framework that models metrics as a distributed data-flow graph, enabling real-time identification of performance bottlenecks.
- Low Overhead: Metrics are stored in shared memory (/dev/shm), minimizing disk I/O and ensuring fast access.
- Pre-installed: PA comes bundled with OpenSearch versions 2.0 and above, simplifying deployment
- CLI Tool - PerfTop: Provides a command-line interface for real-time visualization of cluster performance metrics.
#### Cons:
- Limited Retention: By default, metrics are retained for only 7 minutes (max 60 minutes), which may not suffice for historical analysis.
- Integration Challenges: PA doesn't natively export metrics to external systems like Prometheus or OpenTelemetry, requiring additional tooling for such integrations.
- Visualization Limitations: While PerfTop offers real-time views, there's no built-in support for long-term dashboards or integrations with tools like Grafana.
- Operational Complexity: Managing RCA graphs and ensuring consistent configurations across nodes can be intricate.
- Resource Consumption: Under heavy workloads, PA can consume up to 1 GB of shared memory, which might be a concern in resource-constrained environments.
## Next steps
### Immediate Plan
The OpenSearch Metrics Framework is currently in an experimental state and provides limited metrics and trace information. In contrast, both the Aiven plugin and the Elasticsearch Exporter deliver significantly greater value in their current implementations. These solutions offer Prometheus-compatible scraping endpoints that allow direct metric collection, and users can leverage Grafana for visualization and alerting. Both approaches provide substantial value through pre-built dashboards and alerts, offering users ready-to-deploy solutions.
However, the community faces two critical challenges: limited ongoing support for Aiven's Prometheus Exporter Plugin for OpenSearch, and the potential for API divergence between Elasticsearch and OpenSearch that could impact the Prometheus Community's Elasticsearch Exporter. This uncertainty prevents users from confidently committing to either solution. Therefore, the OpenSearch project must advocate for extended support of one or both approaches.
#### Option 1: Support Plugin approach with Aiven’s Prometheus Exporter Plugin
The Aiven plugin is great for collecting node-based OpenSearch metrics, though it has received limited support since version 2.17. We propose migrating this project under the OpenSearch-Project organization to ensure continued maintenance and development. The original plugin author has expressed support for this transition and has been instrumental in maintaining the plugin for the community.
This approach offers strategic advantages as the plugin aligns well with our long-term vision, allowing us to integrate metrics generated by this plugin into the OpenSearch Telemetry Framework. However, there are trade-offs to consider: since the plugin operates within the same OpenSearch cluster, it may introduce performance overhead. Additionally, plugin upgrades are coupled to OpenSearch version releases. While the current metrics catalog is smaller compared to the Elasticsearch Exporter, this can be enhanced through continued development.
#### Option 2: Support Side car approach with Prometheus Community’s Elasticsearch Exporter.
The Prometheus Community's Elasticsearch Exporter provides an extensive metrics catalog that integrates seamlessly with predefined alerts and dashboards. The sidecar deployment model reduces performance overhead and eliminates version coupling, though it introduces additional operational complexity for deploying and maintaining a separate service.
A significant community concern involves metric naming conventions. Currently, metrics use Elasticsearch naming, which creates confusion for OpenSearch users. The sidecar approach also becomes more challenging to maintain as new plugins introduce their own metrics and API endpoints. While the original exporter authors have graciously supported both Elasticsearch and OpenSearch, addressing the naming confusion and potential long-term API divergence may require forking the exporter to create a dedicated OpenSearch version.
**Comparing Short term options based on requirements:**
| Requirement | Option 1: Aiven Prometheus Exporter Plugin (In-Process Plugin) | Option 2: Prometheus Elasticsearch Exporter (Sidecar) | Comments |
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| **Export telemetry data and monitor clusters effectively** | Supports exporting node metrics | Supports extensive metrics | Both options are metrics-only; neither supports full telemetry (logs, traces, etc.) yet |
| **Minimal performance impact on OpenSearch cluster** | Runs inside OpenSearch; may introduce performance overhead | Runs externally; minimal impact on OpenSearch cluster | Sidecar model is more isolated but requires separate resources |
| **Support flexible export mechanisms** | Supports scraping endpoint only | Supports scraping endpoint only | Both rely on Prometheus scraping; no native support for OTLP or file output |
| **Plugin integration with telemetry framework** | Aligns well with OpenSearch telemetry framework; plugins can integrate metrics easily | Difficult for plugins to expose metrics via the sidecar | In-process plugin can leverage internal APIs and metrics registry |
| **Deployment and maintenance overhead** | No extra deployment if bundled with OpenSearch distribution | Requires separate deployment and monitoring of sidecar service | Plugin simplifies ops if shipped as part of OpenSearch |
| **Long-term maintainability and risk of divergence** | Maintained under OpenSearch project control (proposed) | Risk of eventual fork due to OpenSearch & Elasticsearch API differences and naming conflicts | Sidecar may require forking to better align with OpenSearch specific conventions |
#### **Proposed Approach: Short term**
**Option 1: Supporting the Aiven Prometheus Exporter Plugin** as the preferred approach.
This in-process plugin model aligns closely with our architectural goals, making it easier to integrate plugin-generated metrics into the broader telemetry framework. It also simplifies deployment by avoiding additional services and ensures tighter control over maintenance by bringing the plugin under the OpenSearch Project organization.
While Option 2 (sidecar approach using the Prometheus Community’s Elasticsearch Exporter) offers a rich metrics catalog and operational isolation, it introduces long-term challenges in maintaining compatibility, increases operational complexity, and complicates plugin integration. These drawbacks make it less suitable as a foundation for the unified telemetry framework we envision.
### **Long term plan**
Our vision is to extend the existing OpenSearch Telemetry Framework to serve as the single source of truth for all telemetry data generated within OpenSearch. This centralized registry should be accessible to all plugins, enabling them to emit logs, traces, and metrics through a standardized interface.
By default, the Metrics Framework should support writing telemetry data to the local file system while providing the option to forward data to an OpenTelemetry (OTel) Collector. This approach will empower the OpenSearch community to integrate with their preferred monitoring systems, promoting greater flexibility and ecosystem compatibility.
<p align="center">
<img src="https://github.com/user-attachments/assets/5248af9e-efef-46c0-95c1-f7c1bba668c4" width="400" height="400">
</p>
## Questions for the community:
1. Are you using any existing framework/stack to manage your OpenSearch cluster?
a. If yes, Which is the framework/stack and what's the best part about it?
b. If not, Do you see yourself using any of the aforementioned telemetry frameworks in future?
1. What is your opinion on the short term and long term approaches for OpenSearch's Telemetry framework?
1. What are the most important metrics or category of metrics for you to manage the OpenSearch clusters?
## Appendix
### 1. Setup and sample configs
The below setup was used to compare out the four approaches mentioned in the background
* prometheus-exporter-plugin-for-opensearch:
prometheus.yml
```
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "opensearch"
#opensearch username and password
basic_auth:
username: 'admin'
password: 'admin'
metrics_path: "/_prometheus/metrics"
# scheme defaults to 'http'
static_configs:
- targets: ["localhost:9200"]
```
* elasticsearch-exporter:
prometheus.yml
```
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "opensearch"
#opensearch username and password
basic_auth:
username: 'admin'
password: 'admin'
# default metrics path for elasticsearch_exporter
metrics_path: "/metrics"
# scheme defaults to 'http'
static_configs:
# default port for elasticsearch_exporter
- targets: ["localhost:9114"]
```
* Opensearch Metrics Framework:
Otel collector config
```
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
exporters:
otlphttp/prometheus:
endpoint: "http://localhost:9090/api/v1/otlp"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/prometheus]
```
### 2. References
1. OpenSearch steering committee issue: https://github.com/opensearch-project/technical-steering/issues/35
2. Aiven OpenSearch exporter plugin: https://github.com/Aiven-Open/prometheus-exporter-plugin-for-opensearch
3. ElasticSearch exporter: https://github.com/prometheus-community/elasticsearch_exporter
4. Support for OpenSearch in ElasticSearch exporter: https://github.com/prometheus-community/elasticsearch_exporter/issues/984
5. Discussion on moving the Aiven plugin to OpenSearch core: https://github.com/opensearch-project/OpenSearch/issues/8990#issuecomment-2098510772
6. OpenSearch metrics framework: https://docs.opensearch.org/docs/latest/monitoring-your-cluster/metrics/getting-started/
7. OpenSearch distributed tracing: https://docs.opensearch.org/docs/latest/observing-your-data/trace/distributed-tracing/
8. OpenSearch PA plugin: https://docs.opensearch.org/docs/latest/monitoring-your-cluster/pa/index/
### Special Thanks
This RFC was pre-reviewed by the @KarstenSchnitter @sam-herman @spapadop @ritvibhatt @lezzago @oberkem @lukas-vlcek @anirudha. Special thanks to them for their help in this effort.