Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Describe the issue : Does Opensearch support Distributed tracing . In the Documents it says Dont use it in Production.
Thanks
Configuration :
Relevant Logs or Screenshots :
Leeroy
September 2, 2025, 11:33am
2
Hi @ak_amkum19 ,
Distributed tracing is supported as you mentioned you can read about it in this doc - https://docs.opensearch.org/latest/observing-your-data/trace/distributed-tracing/ , but is recommended not to use in production environments as of yet.
More can be read on the GitHub issue here - https://github.com/opensearch-project/OpenSearch/issues/6750 . I’d suggest testing in a development environment for now, and updating the GitHub issue to see if any progress has been made as of yet.
Leeroy.
opened 11:00AM - 20 Mar 23 UTC
enhancement
distributed framework
Roadmap:Stability/Availability/Resiliency
lucene
**Is your feature request related to a problem? Please describe.**
Feature Requ… est #1061
**Describe the solution you'd like**
## Problem Statement ##
OpenSearch doesn’t have the capabilities to trace the request end to end with tasks level breakdown. It provides some limited support for X-Opaque-Id where clients can pass this id and the same will be returned in the response, it is also being used in the deprecation logger, etc. For deeper analysis and debugging, OpenSearch requires an extensive tracing to locate the resource utilization and the code paths that are hot or consuming more resources.
## Tenets ##
1. **Minimal Overhead** – Tracing shouldn’t add more than 1% of overhead on system resources like CPU and memory. We should prefer minimal tracing to reduce the overhead.
2. **No performance impact** – Tracing shouldn’t add any performance impact to cluster operations.
3. **Extensible** – Tracing framework should be easily extendable for background tasks, plugins and extensions.
4. **Safety** - Framework should ensure that there is no memory leak in the instrumentation code.
5. **Well defined Abstractions** – Tracing frameworks should have well defined abstractions so that in case of implementation framework upgrades, API contract changes, implementing classes shouldn’t be impacted.
6. **Flexible** – It should have the mechanism to enable/disable tracings at different granularity such as levels, components (search, indexing, etc.), etc through dynamic settings.
7. **Meaningful/Purposeful** – Framework should have the mechanism to avoid exploitation of tracing.
8. **Clear boundaries** – Framework should define the clear boundaries of instrumentation generation and sinks.
## Tracing Framework ##
It’s a well-known fact that tracing comes with a cost and high throughput systems like OpenSearch this cost may be huge. So, we need to be very cautious while designing the tracing framework for OpenSearch. Tracing framework will provide the abstractions, governance, utilities, context propagation techniques/guidance, sampling options. So that developers/users can simply use the tracing framework and emit the traces. Let’s discuss these points in detail –
1. **Abstractions** – We can use tracing frameworks such as Open Telemetry. But we will abstract out the tracing solution behind the OpenSearch APIs so that in the future we need not change the core OpenSearch code where the actual instrumentation is being done. We would be very cautious defining the abstractions so that we shouldn’t be doing the over engineering as well.
2. **Sampling** – We will provide the support for both head and tail based samplings, it will be configurable or decided based on the need. Framework would provide the sampling implementations, which can be configured at a component level. Like sampling requirements for search requests could be different from the indexing requests and background jobs.
1. Head Based - Head-based sampling collects and stores trace data randomly. Typically, head-based sampling happens within the agent responsible for collecting trace telemetry by randomly selecting which traces to sample for analysis. We can customize the selection logic to sample based on the request type/query type etc. Downside of this sampling technique is that we can miss out on the slow requests.
2. Tail Based - Tail-based sampling collects all information about that trace when it’s complete but stores the samples only. This is helpful in reducing the load to data stores and even pressure on in-memory data sources. This gives us the opportunity to evaluate the trace before dropping and can be a great use for our use cases. Let’s say if we can configure the minimum threshold and traces taking time lower than that can be filtered out.
3. **Governance**
1. Framework will provide the dynamic tracing enable/disable options at different levels like components, levels, entire framework, etc.
2. It will also define the ways instrumentation can be done such as task level, threadpool level, Listeners Support, etc.
3. Management of tracing objects.
4. Tracing Levels – Like logging levels, we will introduce the tracing levels so that in case of issues we can disable/enable the tracing based on levels. i.e. Calculating CPU/Memory consumption by a method is a heavy operation so we should be able to enable this as and when needed. It will also help developers/users to think about the need and exact level of their tracing so we shouldn’t be bloating the system with unnecessary traces.
4. **Code pollution** – With all this ease of instrumentation, it's highly likely that it will get exploited for unnecessary tracing with lots of boilerplate code.
1. **Avoid extensive tracing**
1. Component Level Management – At each component level we need to define the tracing hooks so that we avoid polluting the actual code where core logic is written. For example, instrumenting search requests we can start tracing from the parent task and pass the context till SearchOperationListener to instrument the major search phase entry/exit points. There will also be a provision to get the open tracing spans by some deterministic id and then do the more granule tracing where ever needed.
2. To highlight the explicit tracing we can come up with some annotation like @EnableTracing so that code reviewers can make sure that it’s absolutely needed. We can add a plugin in the code build that any tracing without this annotation will lead to the build failure.
2. **Boilerplate code** - Framework will make sure that users only need to write just 1 or 2 lines of recording code like logger to reduce the boilerplate code. We can come up with easy to use annotations but runtime annotations might be a little costly.
## High Level Design ##
There are 3 major components of tracing framework -
1. **Tracing APIs** – Tracing APIs will provide the well-defined tracing interfaces/APIs. It will implement most of the heavy lifting or boilerplate code and users will just be able to make a simple call to record the trace. Tracing APIs will clearly abstract out the implementation and users need not to be worried about the underlying framework (maybe OpenTelemetry) or api version changes. APIs should be able to take care of the most of attributes that need to be captured and level them according to their resource footprint.
2. **Bounded In-memory Buffer** - All the traces will be recorded at a parent span level in memory and will be flushed out periodically to the sink. (OPEN – how can we limit the memory usage for the tracing solution)
3. **Sink** - All the flushed traces will be written to any configurable telemetry data store of the user's choice. Sink should also be able to sample the traces before writing to the data store. Sink will lose the traces in case there is some backlog getting built up to reduce the pressure on in-memory buffers. Users can select between the following 2 types of Sinks.
1. Co-located Sink – This type of sink will run in the same JVM (core OS) and keep on writing to the telemetry data store. Users will not have to maintain any additional components but may need to provide some additional resource for co-located sinks to run.
2. Sidecar sink – This type of sink will run as a separate process on the same node. Tracing framework will push the traces to the sidecar sink through gRPC calls and the sidecar sink will write the data to the telemetry data store.
We will provide the support for both and use these in combinations. Like may be for search queries head based make more sense where we can filter out the traces for same query and tail based make sense for background tasks etc.
## Next Steps ##
1. Decide on the telemetry framework, OpenTelemetery looks the natural choice but need to deep dive and see if make sense for system like OpenSearch.
2. Prototyping the trace creation and context propagation along the request.
3. Sampling approaches.
4. Sink/Exporter approach.
5. Levels of tracing
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Additional context**
Add any other context or screenshots about the feature request here.
Hi @Leeroy
Thanks for your Reply.