I’m in the process of migrating a very complex transform from ElasticSearch to OpenSearch, and it isn’t going as smoothly as I would have expected (min and max aggregations on date fields in OpenSearch produce ridiculous results like -Infinity and 1.620755626993E12).
The _explain endpoint does not provide any helpful information when debugging issues with transforms:
"status" : "failed",
"failure_reason" : "Failed to index the documents",
Where are useful errors being logged? I have re-run the transform twice, and both times it stopped very close to the same spot, so I’m sure its an edge case my cleanup pipeline isn’t handling yet, but having a faster way to get pointed to the errors would make a world of difference.
Nope. Its gotten worse as our data size has grown. We are unable to run anything other than the most trivial transforms on pathetically small data sets. I have built ingest pipelines to take care of the petty oversights like Infinity / -Infinity and scientific notation on date field aggregations. My solution for finding documents that fail in the transform is exceedingly manual:
Set transform page size to 1
Run transform until it fails
Manually inspect source data for possible issues related to the transform around the spot where the failure occurs.
3.a. Run the transform in preview with a query to limit it to just the data that seems to be failing to try to understand why.
I believe at the time that I posted this my issues were all related to unexpected values for min / max transforms on things like dates or counts. In my case I wrote an ingest pipeline that performs sanity checks and proper conversion for every field coming out of the transform. (For us, Infinity / -Infinity have no value, and dates should be ISO8601, longs shouldn’t be in E notation, etc.)
In addition to the completely opaque error reporting, my biggest problem is that large transforms will fail to start, or will claim to start but never produce any meaningful results in the output index.
Admittedly, we are still on the 1.3 release branch, so there is a chance things have improved in 2.4, but from my cursory glances at the release notes I haven’t seen anything directly addressing our challenges.
We are moving our largest datasets out of OpenSearch. It has not proven to be the right platform for our data as it exploded by an order of magnitude this year.
EDIT: I forgot to answer the question the first time.
I’m sure there is a more efficient path to parsing the scientific notation into a Long for conversion back into an OffsetDateTime (we log everything in UTC time), but this works and other parts of our ingest pipelines are slower than this.
Hope it helps a little. We are moving all of our large data processing out of OpenSearch and into Apache Pulsar and Pinot. We will keep some exploratory data in OpenSearch, but it will be an order of magnitude less, and none of our large scale data transformations are going to be done in OpenSearch.