Failed to parse field [vector] of type [knn_vector]

Amphagory · November 18, 2024, 4:15am

OpenSearch Managed Cluster version 2.15

I have set up a ingestion pipeline as follows:

version: '2'
log-pipeline:
  source:
    s3:
      codec:
        parquet:
      compression: none
      aws:
        region: us-east-1
        sts_role_arn: '<ARN>'
      acknowledgments: true
      scan:
        buckets:
          - bucket:
              name: test
              filter:
                include_prefix:
                  - embeddings/
      delete_s3_objects_on_read: false
  processor:
    - date:
        destination: 'ingested_at'
        from_time_received: true             
  sink:
    - opensearch:
        hosts: [<HOST>]
        index: 'test'
        aws:
          sts_role_arn: '<ARN>'
          region: us-east-1
        dlq:
          s3:
            bucket: test-dlq
            region: us-east-1
            sts_role_arn: '<ARN>'

An example of the Polars dataframe saved to parquet is as follows:

timestamp		type			vector
i64				str				list[f64]
1727076649		a				[0.042296, 0.047431, … -0.010195]
1727093762		b				[0.0, 0.0, … 0.0]
1727052674		a				[-0.062857, -0.040043, … -0.039441]

My index template is as follows:

{
  "index_patterns": [
    "test*"
  ],
  "template": {
    "settings": {
      "index.knn": true,    
      "number_of_replicas": 0
    },
    "mappings": {
      "properties": {
        "timestamp": {
          "type": "integer"
        },    
        "type": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "vector": {
          "type": "knn_vector",
          "dimension": 300,
          "method": {
            "engine": "nmslib",
            "space_type": "l2",
            "name": "hnsw",
            "parameters": {}
          }
        }
      }
    }
  }
}

In the logs, I’m getting the following errors:

2024-11-13T19:31:45.154 [log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - index = test, operation = Index, status = 400, error = failed to parse field [vector] of type [knn_vector] in document with id 'tGwCJ5MBaH-WR-jbpXBx'. Preview of field's value: '{element=-0.109751857817173}'

2024-11-13T19:31:45.154 [log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - index = test, operation = Index, status = 400, error = failed to parse field [vector] of type [knn_vector] in document with id 'smwCJ5MBaH-WR-jbpXBx'. Preview of field's value: '{element=-0.06285733729600906}'

2024-11-13T19:31:45.154 [log-pipeline-sink-worker-2-thread-2] WARN  org.opensearch.dataprepper.plugins.sink.opensearch.BulkRetryStrategy - index = test, operation = Index, status = 400, error = failed to parse field [vector] of type [knn_vector] in document with id 'sWwCJ5MBaH-WR-jbpXBx'. Preview of field's value: '{element=0.0}'

I’m wondering if I have the wrong data type or set up either the pipeline or index template incorrectly to get this error?

Have anyone thoughts to my issues?

M.

yeonghyeonKo · November 18, 2024, 2:20pm

There are two possible reasons why your cluster failed to parse knn_vector type field.

1. Does vector field of the data from parquets in your S3 have 300 dimensions? The dimension of actual data and it’s index mapping should be same.
1. I’m suspicious of the type of vector saved in parquet. The example you’ve shown says it’s list[f64] type, but float32 type is required like supported pretrained-models.

Amphagory · November 18, 2024, 3:00pm

Hello,

Thanks for giving your insight.

For point 1, the vector has 300 dimensions.

For point 2, I’ve changed the data type to list[float32] and it presents the same error.

M.

Topic		Replies	Views
Loading KNN vector using Parquet on S3 Data Prepper	1	439	March 15, 2024
Ingestion pipeline for a nested field OpenSearch troubleshoot	3	870	September 19, 2024
Log message that fails index due to missing data stream timestamp OpenSearch	4	148	October 6, 2024
Neural search not working with nested vector field mappings OpenSearch releases , discuss , troubleshoot , configure	0	161	September 6, 2024
Parse_json proccesor Data Prepper	1	293	October 30, 2024

Failed to parse field [vector] of type [knn_vector]

Related topics