Handling Traces through a serverless framework API without caller control

I am looking into the OpenSearch trace analytics stack to help instrument a largely serverless (Lambda) API stack and have hit a bit of an issue with how the Data Prepper handles traces from the Lambda OTEL collector. It seems that the data prepper does not handle traces that cannot be tracked back to a ROOT context (parent context of NULL), and I have a feeling this might be on purpose, but our interactions are largely scripted or 3rd party - so there’s not a UI to speak of or we don’t have control over it to set the root context within. The behavior of the OTEL lambda layers is to generate or inherit a parent from the vendor (AWS) environment, so once the request gets to lambda there’s always a parent request in the context, but that parent is never in the active context. That context with parent gets sent to the prepper, the prepper never gets the root (null parent) context and therefore, data prepper doesn’t set a traceGroup field - which is what the entire trace analytics plugin keys on.

Does anyone have suggestions on how to handle this case? I don’t see an option to allow data prepper to use the top-most trace as the traceGroup, even if it’s not a null parent (that would be ideal, I think).

@mentzerk Thank you for your interest in using Data Prepper. Yes, currently Data Prepper does traceGroup field extraction relying on the ONLY hardcoded criteria: parentSpanId == null. It will cause issue with your scenario or more generally the serverless instrumentation

Does anyone have suggestions on how to handle this case? I don’t see an option to allow data prepper to use the top-most trace as the traceGroup, even if it’s not a null parent (that would be ideal, I think).

Unfortunately we do not have options to handle that case yet. We need to either

(1) enrich the root span identification criteria for a traceId
(2) allows user customization on the identifier of root span

Those could be enhancements or features to be put on the Data Prepper roadmap. Would you mind share with us

(1) Are you using any instrumentation library in Serverless? What is the library?
(2) What might be a good identifier of root span in your use case? Or what top-most trace means in your description?

@qchea - Thanks for getting back to me. I’d be happy to share that information.

For 1 - We’ve tried using both the Python and NodeJs flavors of the AWS ADOT lambda collector AND the upstream OTEL Lambda implementation - with similar results. AWS Lambda seems to be generating a context that both implementations interpret as a pre-existing context coming into the execution environment (even when we remove/don’t use the xray specific code that ADOT adds). We can only get traces into the trace analytics plugin if we explicitly set the ROOT_CONTEXT in a new span within the Lambda - which becomes a little problematic when there’s more than one data flow into the system.

For 2 - When referring to the “top most” trace - I was thinking that could be the terminal trace in the data prepper when it’s building the service map and traceGroups. So - as an example Service A is the entry point, and has an errant parent from AWS Lambda, then calls service B and C that have service A as parent and Service B calls service D. In that case, Data Prepper would use the info it did have and use Service A as the traceGroup and “parent”

For reference - here are the links to the AWS ADOT collector and the OTEL Lambda Collector:

@mentzerk Thanks for sharing. For 1, we will need to do some further investigation on the trace structure produced by the lambda instrumentation library and will reach back for further info if necessary. Just to follow-up and clarify on

We can only get traces into the trace analytics plugin if we explicitly set the ROOT_CONTEXT in a new span within the Lambda

  1. By get traces into the trace analytics do you mean that traceGroup and traceGroupFields will not be empty if you explicitly set the ROOT_CONTEXT? I expect the spans will still be exported successfully to the opensearch backend anyway. The difference is whether traceGroup is populated successfully.

  2. Would you mind give an example on how explicitly setting ROOT_CONTEXT manifest itself in a new span?

For 2, would you mind share any sample spans that all belong to a particular traceId? In particular, we would like to know what the root span in Service A look like in its key-value’s, which might help us establish new criteria.

@qchea - Yes - you are correct on number 1. The records are inserted into ES either way, but only have a traceGroup if there’s a valid ROOT_CONTEXT in the chain of traces that data prepper receives. The trace analytics plugin side of kibana seems to only show traces with a traceGroup set, so not having a traceGroup effectively excludes those items from the plugin.

As for setting the root context, it varies by language - but in NodeJS, we do something like this:

const parentSpan = tracer.startSpan('main', undefined, opentelemetry.ROOT_CONTEXT);

Then we create spans with that span as the parent and everything works ok on the data prepper/trace analytics side. But - that’s less than ideal as it discards any existing span history, so it can’t be using in places that are only occasionally an entry point.