Troubleshooting Transform Failures

I’m in the process of migrating a very complex transform from ElasticSearch to OpenSearch, and it isn’t going as smoothly as I would have expected (min and max aggregations on date fields in OpenSearch produce ridiculous results like -Infinity and 1.620755626993E12).

The _explain endpoint does not provide any helpful information when debugging issues with transforms:

"status" : "failed",
      "failure_reason" : "Failed to index the documents",

Where are useful errors being logged? I have re-run the transform twice, and both times it stopped very close to the same spot, so I’m sure its an edge case my cleanup pipeline isn’t handling yet, but having a faster way to get pointed to the errors would make a world of difference.

@jbolle were you able to resolve this? I am also noticing the same error.

Nope. Its gotten worse as our data size has grown. We are unable to run anything other than the most trivial transforms on pathetically small data sets. I have built ingest pipelines to take care of the petty oversights like Infinity / -Infinity and scientific notation on date field aggregations. My solution for finding documents that fail in the transform is exceedingly manual:

  1. Set transform page size to 1
  2. Run transform until it fails
  3. Manually inspect source data for possible issues related to the transform around the spot where the failure occurs.
    3.a. Run the transform in preview with a query to limit it to just the data that seems to be failing to try to understand why.

I believe at the time that I posted this my issues were all related to unexpected values for min / max transforms on things like dates or counts. In my case I wrote an ingest pipeline that performs sanity checks and proper conversion for every field coming out of the transform. (For us, Infinity / -Infinity have no value, and dates should be ISO8601, longs shouldn’t be in E notation, etc.)

In addition to the completely opaque error reporting, my biggest problem is that large transforms will fail to start, or will claim to start but never produce any meaningful results in the output index.
Admittedly, we are still on the 1.3 release branch, so there is a chance things have improved in 2.4, but from my cursory glances at the release notes I haven’t seen anything directly addressing our challenges.
We are moving our largest datasets out of OpenSearch. It has not proven to be the right platform for our data as it exploded by an order of magnitude this year.

EDIT: I forgot to answer the question the first time.

I’m also seeing this, and we’re on 2.3. My dataset is tiny (6 docs).

A min aggregation on a missing field resulted in the -Infinity result rather than null. I added “missing” to the aggregation to prevent this, which is not ideal.

The aggregation min/max functions output a double, which will print out using scientific notation for sufficiently large numbers - i.e. epoch millis value for a date.

The preview works.
It works if I manually take the docs from the preview output and manually PUT them into the target index (adjusting scientific output to a number).

However, trying to run this as a transform job, I get the same error as in the OP.

Here is the script ingest processor I created to clean up our data coming from the transform:

 {
        "script" : {
          "ignore_failure" : false,
          "on_failure" : [
            {
              "append" : {
                "field" : "pipeline_error",
                "value" : [
                  "clean_malformed_dates"
                ]
              }
            }
          ],
          "lang" : "painless",
          "description" : "cleans malformed date fields.",
          "source" : """
          for (f in params['fields']){
            if (ctx[f] == null) {
              continue;
            }
            if (ctx[f] == "Infinity" || ctx[f] == "-Infinity") {
              ctx.remove(f);
              continue;
            }
            if (ctx[f].toString().contains('E')){
              ctx[f] = Double.parseDouble(ctx[f].toString()).longValue();
            }
            
            String milliSinceEpochString = ctx[f].toString();
            long milliSinceEpoch = Long.parseLong(milliSinceEpochString);
            Instant instant = Instant.ofEpochMilli(milliSinceEpoch);
            ZonedDateTime zdt = ZonedDateTime.ofInstant(instant, ZoneId.of('Z'));
            ctx[f] = zdt;
          }
          """,
          "params" : {
            "fields" : [
              "last_report",
              "last_update",
              "last_ingest",
              "last_timestamp",
              "last_seen",
              "first_report",
              "first_update",
              "first_ingest",
              "first_timestamp",
              "first_seen",
              "updated_at_interval"
            ]
          }
        }
      }

I’m sure there is a more efficient path to parsing the scientific notation into a Long for conversion back into an OffsetDateTime (we log everything in UTC time), but this works and other parts of our ingest pipelines are slower than this.

Hope it helps a little. We are moving all of our large data processing out of OpenSearch and into Apache Pulsar and Pinot. We will keep some exploratory data in OpenSearch, but it will be an order of magnitude less, and none of our large scale data transformations are going to be done in OpenSearch.

Thank you, @jbolle - that was very kind of you to share the ingest processor. It helped a great deal!

I couldn’t see a way to refer to the pipeline from the transform explicitly, so I set a default pipeline on my index to ensure the pipeline is executed.

PUT /test_transformed/_settings
{
  "index": {
    "default_pipeline": "test-pipeline"
  }
}

Yes, exactly. Sorry about not making that clear. I have a default pipeline on the target index for the output of the transform that runs that pipeline.

@jbolle @0penS3arch Can you please post your transform job definition/document?

Transform is not doing anything special. It is just executing Composite Aggregation query.