Provided Text Chunking Example fails with Neural Sparse!

OS 3.0.0/ Windows 11

Hi all,

I have tried the example here Text chunking - OpenSearch Documentation and it works very well with knn dense vectors. However when I change the embedding type to neural sparse it fails. Below is the full example.

I highly appreciate your help/suggestions.

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT"
}
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
  "description": "A text chunking and embedding ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "B973sJYBVv-6fOo2U1W1",
        "field_map": {
          "passage_chunk": "passage_chunk_embedding"
        }
      }
    }
  ]
}
PUT testindex
{
  "settings": {
    "index": {
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "passage_chunk_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "rank_features"
          }
        }
      }
    }
  }
}
POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
{
  "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}

It fails to ingest and gives the below error:

{
  "error": {
    "root_cause": [
      {
        "type": "null_pointer_exception",
        "reason": "Cannot invoke \"java.util.List.iterator()\" because \"tensorsList\" is null"
      }
    ],
    "type": "null_pointer_exception",
    "reason": "Cannot invoke \"java.util.List.iterator()\" because \"tensorsList\" is null"
  },
  "status": 500
}

You can try to use this index mapping:

“passage_chunk_embedding”: {
  “type”: “nested”,
  “properties”: {
  “sparse_encoding”: {
    “type”: “rank_features”
  }
}

Meanwhile, you should use the sparse_encoding processor, instead of the text_embedding. processor in the ingestion pipeline. This is an example:

PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
  “description”: “A text chunking and embedding ingest pipeline”,
  “processors”: [
    {
      “text_chunking”: {
        “algorithm”: {
          “fixed_token_length”: {
            “token_limit”: 10,
            “overlap_rate”: 0.2,
            “tokenizer”: “standard”
          }
        },
        “field_map”: {
          “passage_text”: “passage_chunk”
        }
      }
    },
    {
      “sparse_encoding”: {
        “model_id”: “XrRvlpoBgZd8yJbf_sJH”,
        “field_map”: {
          “passage_chunk”: “passage_chunk_embedding”
        }
      }
    }
  ]
}