Failed to parse field [vector] of type [knn_vector]

OpenSearch Managed Cluster version 2.15

I have set up a ingestion pipeline as follows:

version: '2'
log-pipeline:
  source:
    s3:
      codec:
        parquet:
      compression: none
      aws:
        region: us-east-1
        sts_role_arn: '<ARN>'
      acknowledgments: true
      scan:
        buckets:
          - bucket:
              name: test
              filter:
                include_prefix:
                  - embeddings/
      delete_s3_objects_on_read: false
  processor:
    - date:
        destination: 'ingested_at'
        from_time_received: true             
  sink:
    - opensearch:
        hosts: [<HOST>]
        index: 'test'
        aws:
          sts_role_arn: '<ARN>'
          region: us-east-1
        dlq:
          s3:
            bucket: test-dlq
            region: us-east-1
            sts_role_arn: '<ARN>'

An example of the Polars dataframe saved to parquet is as follows:

timestamp		type			vector
i64				str				list[f64]
1727076649		a				[0.042296, 0.047431, … -0.010195]
1727093762		b				[0.0, 0.0, … 0.0]
1727052674		a				[-0.062857, -0.040043, … -0.039441]

My index template is as follows:

{
  "index_patterns": [
    "test*"
  ],
  "template": {
    "settings": {
      "index.knn": true,    
      "number_of_replicas": 0
    },
    "mappings": {
      "properties": {
        "timestamp": {
          "type": "integer"
        },    
        "type": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "vector": {
          "type": "knn_vector",
          "dimension": 300,
          "method": {
            "engine": "nmslib",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        }
      }
    }
  }
}

In the logs, I’m getting the following errors:

2024-11-13T19:31:45.154 [log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - index = test, operation = Index, status = 400, error = failed to parse field [vector] of type [knn_vector] in document with id 'tGwCJ5MBaH-WR-jbpXBx'. Preview of field's value: '{element=-0.109751857817173}'

2024-11-13T19:31:45.154 [log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - index = test, operation = Index, status = 400, error = failed to parse field [vector] of type [knn_vector] in document with id 'smwCJ5MBaH-WR-jbpXBx'. Preview of field's value: '{element=-0.06285733729600906}'

2024-11-13T19:31:45.154 [log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - index = test, operation = Index, status = 400, error = failed to parse field [vector] of type [knn_vector] in document with id 'sWwCJ5MBaH-WR-jbpXBx'. Preview of field's value: '{element=0.0}'

I’m wondering if I have the wrong data type or set up either the pipeline or index template incorrectly to get this error?

Have anyone thoughts to my issues?

M.

There are two possible reasons why your cluster failed to parse knn_vector type field.

    1. Does vector field of the data from parquets in your S3 have 300 dimensions? The dimension of actual data and it’s index mapping should be same.
    1. I’m suspicious of the type of vector saved in parquet. The example you’ve shown says it’s list[f64] type, but float32 type is required like supported pretrained-models.

Hello,

Thanks for giving your insight.

For point 1, the vector has 300 dimensions.

For point 2, I’ve changed the data type to list[float32] and it presents the same error.

M.