Transform job accuracy

SideHero · March 28, 2025, 11:04am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

{
  "name": "653e1208a3ec",
  "cluster_name": "opensearch",
  "cluster_uuid": "XsEC1-tETh-cCcAWdBgaOQ",
  "version": {
    "distribution": "opensearch",
    "number": "2.19.1",
    "build_type": "tar",
    "build_hash": "2e4741fb45d1b150aaeeadf66d41445b23ff5982",
    "build_date": "2025-02-27T01:16:47.726162386Z",
    "build_snapshot": false,
    "lucene_version": "9.12.1",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}

Describe the issue:
I have a base index called X. In the past I have made a transform job that does some aggregation on X, namely summarizing the number of documents for a certain term. I noticed after a while that the aggregated data is not accurate with the actual base index X (the doc counts didn’t align). First of all, how is this possible? (I checked the transform job and no more documents needed to be processed)

I then made another transform job, using the exact same settings as the previous one. After it was done processing the documents from the base index, I noticed that the metrics of both transform jobs don’t align at all.

Here are the transform job configurations:

{
  "_id": "transform-x",
  "_version": 8,
  "_seq_no": 7116661,
  "_primary_term": 32,
  "transform": {
    "transform_id": "transform-x",
    "schema_version": 17,
    "schedule": {
      "interval": {
        "start_time": 1715097728,
        "period": 1,
        "unit": "Minutes"
      }
    },
    "metadata_id": "6ZNeSM3gb0OIsXI-p3sEBQ",
    "updated_at": 1715104885671,
    "enabled": true,
    "enabled_at": 1715104885671,
    "description": "",
    "source_index": "X",
    "data_selection_query": {
      "match_all": {
        "boost": 1
      }
    },
    "target_index": "transform_X",
    "page_size": 10,
    "groups": [
      {
        "terms": {
          "source_field": "@charid",
          "target_field": "@charid_terms"
        }
      },
      {
        "terms": {
          "source_field": "@mobid",
          "target_field": "@mobid_terms"
        }
      }
    ],
    "aggregations": {
      "count_@mobid": {
        "value_count": {
          "field": "@mobid"
        }
      }
    },
    "continuous": true
  }
}

{
  "_id": "transform-y",
  "_version": 4,
  "_seq_no": 18382258,
  "_primary_term": 75,
  "transform": {
    "transform_id": "transform-y",
    "schema_version": 21,
    "schedule": {
      "interval": {
        "start_time": 1721312263671,
        "period": 1,
        "unit": "Minutes"
      }
    },
    "metadata_id": "HClXG3G0wgQMBppjcJfgQg",
    "updated_at": 1741993333230,
    "enabled": true,
    "enabled_at": 1741993333230,
    "description": "testing purposes",
    "source_index": "X",
    "data_selection_query": {
      "match_all": {
        "boost": 1
      }
    },
    "target_index": "transform_X2",
    "page_size": 10,
    "groups": [
      {
        "terms": {
          "source_field": "@charid",
          "target_field": "@charid_terms"
        }
      },
      {
        "terms": {
          "source_field": "@mobid",
          "target_field": "@mobid_terms"
        }
      }
    ],
    "aggregations": {
      "count_@mobid": {
        "value_count": {
          "field": "@mobid"
        }
      }
    },
    "continuous": true
  }
}

And here are the metrics:

{
  "transform-x": {
    "metadata_id": "6ZNeSM3gb0OIsXI-p3sEBQ",
    "transform_metadata": {
      "transform_id": "transform-x",
      "last_updated_at": 1743152459339,
      "status": "started",
      "failure_reason": null,
      "stats": {
        "pages_processed": 667234,
        "documents_processed": 6498416982,
        "documents_indexed": 2317772,
        "index_time_in_millis": 34657427,
        "search_time_in_millis": 7209289
      },
      "continuous_stats": {
        "last_timestamp": 1743152457906,
        "documents_behind": {
          "X": 23
        }
      }
    }
  }
}

{
  "transform-y": {
    "metadata_id": "HClXG3G0wgQMBppjcJfgQg",
    "transform_metadata": {
      "transform_id": "transform-y",
      "last_updated_at": 1743152444904,
      "status": "started",
      "failure_reason": null,
      "stats": {
        "pages_processed": 457313,
        "documents_processed": 2932486986,
        "documents_indexed": 1709538,
        "index_time_in_millis": 35491862,
        "search_time_in_millis": 6177677
      },
      "continuous_stats": {
        "last_timestamp": 1743152443778,
        "documents_behind": {
          "X": 50
        }
      }
    }
  }
}

How is it possible that the aggregated data fails to align with the base index? And why are the two transforms not aligning in terms of processing stats?

Topic		Replies	Views
Transform Job Missing Data Index Management	1	201	July 5, 2024
Transform Job "Failed to get the modified buckets in source indices" Index Management troubleshoot , index-management	2	359	April 30, 2024
My transform job is taking forever Index Management troubleshoot , index-management	1	383	February 23, 2024
Unable to run my transform job after updating to OpenSearch 2.11 Index Management troubleshoot	3	401	April 9, 2024
Error while creating Index transform Index Management troubleshoot , index-management	4	573	August 22, 2023

Transform job accuracy

Related topics