AWS OpenSearch service lost one node during indexation of knn vectors

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 1.3 on AWS

Describe the issue:
I have a OpenSearch Service instance on AWS with 2 nodes, during each indexation with knn vectors, one node was lost near the end of indexation. Thanks to the automatic remediation of red clusters, the lost node was restored 30 minutes later.

The vector index contains some text fields and a knn field, with more than 2 millions documents, the total size is 11gb in index.
I have a Python3 program to get data from an index with scroll, and run vector indexation with _bulk

How to avoid the problem of lost node?

Configuration:
OpenSearch Service instance: type c6g.xlarge.search on AWS, 2 nodes with 4 vCPU, 8 Gb RAM, 200 Gb storage, 6000 IOPS, 256 Mo/s of each node

Schema:

{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "knn": true,
      "knn.algo_param.ef_search": 512
    }
  },
  "mappings": {
    "dynamic": "false",
    "properties": {
      "title": {
        "type": "text"
      },
      "text": {
        "type": "text"
      }
      "vector": {
        "type": "knn_vector",
        "dimension": 256,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 2000,
            "m": 16
          }
        }
      }
    }
  }
}

Relevant Logs or Screenshots:
IndexingRate

Why did the node go down? Do you see anything in the logs?

I didn’t know the reason. Here are some logs in CloudWatch:

[2023-10-02T21:19:33,271][WARN ][o.o.c.NodeConnectionsService] [19a43ee7408bca1891925f5ed1897d14] failed to connect to {e7736a956ab97de531be4c788aa799ed}{XLPkuGNkSPulv8HsUpJChw}{0DFrYfW3Q8C4mAlZ_gwv5Q}{IP}{IP}{dimr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, AMAZON_INTERNAL, cross_cluster_transport_address=IP, awareness_features_enabled=true, global_cpu_usage_ac_supported=true, shard_indexing_pressure_enabled=true, AMAZON_INTERNAL, search_backpressure_feature_present=true} (tried [25] times)
ConnectTransportException[[xxx][IP] handshake failed. unexpected remote node {xxx}{xxx}{xxx}{IP}{IP}{dimr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, AMAZON_INTERNAL, cross_cluster_transport_address=IP, awareness_features_enabled=true, global_cpu_usage_ac_supported=true, shard_indexing_pressure_enabled=true, AMAZON_INTERNAL, search_backpressure_feature_present=true}]

[2023-10-02T21:20:17,508][WARN ][o.o.i.c.IndicesClusterStateService] [xxx] [my_knn_index][0] marking and sending shard failed due to [shard failure, reason [merge failed]]
org.apache.lucene.index.MergePolicy$MergeException: java.lang.RuntimeException: java.lang.RuntimeException: [KNN] Adding footer to serialized graph failed: org.apache.lucene.index.MergePolicy$MergeAbortedException: Merge aborted.

During the indexation, CPUUtilization was 60%, JVMMemoryPressure was between 80% and 100%, FreeStorageSpace was 220 Gib.

@Garance Is it possible to repro in latest versions(>=2.5 version)? We can certainly take look but want to make sure if this is something we fixed in 2.x.

@vamshin Thank you for the response, I will test the latest version.

1 Like