How to identify queries loading _id fielddata without enabling DEBUG slow logs?

Versions:

  • OpenSearch 2.19.0

Describe the issue:

We’re seeing several GB of _id fielddata loaded on data nodes in our production cluster. _cat/fielddata?v&s=size:desc shows _id as the top fielddata consumer.

We also see circuit breaker errors:

[FIELDDATA] Data too large, data for [_id] would be [X bytes], which is larger than the limit of [Y bytes]

Something is forcing _id to load into heap via fielddata. However, we can’t find the source — our slow logs show no queries that sort or aggregate on _id.

We also run many Painless scripts inside aggregations and are unsure if any of those could implicitly trigger _id fielddata loading.

How can we identify which queries are loading _id fielddata?

We can’t enable slow log at DEBUG/TRACE level — the query volume would overwhelm the cluster. Is there another approach? For example:

  • Does temporarily lowering indices.breaker.fielddata.limit to a very small value cause the exception stack to include the query source?

  • Can _nodes/hot_threads during high fielddata periods help trace it?

  • Are there any known internal operations that sort on _id?

  • Can Painless scripts using param.source implicitly trigger _id fielddata?

Any guidance appreciated.

@YossiCohen I’ve seen similar errors when the cluster was getting unstable due to high load.

How long do you see those errors in the cluster? Has anything changed recently in the way the cluster is used (higher cluster utilization, increased ingest or search, etc)
What plugins do you use or recently started to use/test?