Hybrid search with inner hits error "failed to expand hits"

Versions:

  • OpenSearch 3.3.2
  • Opensearch Dashboard 3.3.0
    (Windows environment)

Overall situation:
I have a knn index and I need to do a hybrid search collapsing on a keyword field while also retrieving the inner hits. Recently there have been news regarding this matter and starting with the 3.0 release it was officially documented as a supported feature for the hybrid queries (link to the official documentation: Using inner hits in hybrid queries - OpenSearch Documentation).

Issue:
When I try using the collapse with inner hits feature in combination with the hybrid query I get the error:

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "failed to expand hits",
    "phase": "expand",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "search_phase_execution_exception",
      "reason": "all shards failed",
      "phase": "query",
      "grouped": true,
      "failed_shards": [
        {
          "shard": 0,
          "index": "innerhits_expansion_error_index",
          "node": "fMKRQ9WpTxWJENS0O5hI6w",
          "reason": {
            "type": "e_o_f_exception",
            "reason": "read past EOF (pos=2147483647): MemorySegmentIndexInput(path=\"C:\\ZWeb\\OpenSearch\\opensearch\\data\\nodes\\0\\indices\\X3dJFJyZTfie9IdrxqrSOg\\0\\index\\_0.cfs\") [slice=_0.nvd] [slice=randomaccess]"
          }
        }
      ],
      "caused_by": {
        "type": "e_o_f_exception",
        "reason": "read past EOF (pos=2147483647): MemorySegmentIndexInput(path=\"<my_path>\") [slice=_0.nvd] [slice=randomaccess]",
        "caused_by": {
          "type": "e_o_f_exception",
          "reason": "read past EOF (pos=2147483647): MemorySegmentIndexInput(path=\"<my_path>\") [slice=_0.nvd] [slice=randomaccess]"
        }
      }
    }
  },
  "status": 500
}

The error only comes up when I try to also retrieve the inner hits. It seems that the expansion of the inner hits is failing for some reason.

Instructions to replicate the issue:

#INDEX SCHEMA
PUT /innerhits_expansion_error_index
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "author": {
        "store": true,
        "type": "keyword"
      },
      "attachmentData": {
        "store": true,
        "term_vector": "yes",
        "type": "text"
      },
      "chunk_embedding": {
        "dimension": 768,
        "type": "knn_vector"
      },
      "description": {
        "store": true,
        "term_vector": "yes",
        "type": "text"
      }
    }
  }
}

#INGEST PIPELINE
PUT _ingest/pipeline/innerhits_expansion_error_index_ingest_pipeline
{
  "description": "Pipeline di ingestion per core: innerhits_expansion_error_index",
  "processors": [
    {
      "text_embedding": {
        "field_map": {
          "attachmentData": "chunk_embedding"
        },
        "model_id": "wxPEApsBh9UO9dEM5IFh"
      }
    }
  ]
}

#BULK INGESTION
POST /innerhits_expansion_error_index/_bulk?pipeline=innerhits_expansion_error_index_ingest_pipeline
{ "index": {} }
{ "attachmentData": """Wuthering Heights, Emily Brontë's 1847 novel, is a dark, passionate tale set on the bleak Yorkshire moors, exploring obsessive love, revenge, and social class through the destructive relationship of Catherine Earnshaw and Heathcliff, framed by a narrative where outsider Mr. Lockwood hears the tragic story from housekeeper Nelly Dean, revealing a world of fierce emotions and supernatural undertones.""", "description":"""Wuthering Heights""", "author":"""Emily Brontë""" }
{ "index": {} }
{ "attachmentData": """Emily Brontë's "The Night is Darkening Round Me" (also known as "Spellbound") is a powerful poem about being trapped by an intense, perhaps loving, force amidst a fierce, darkening natural landscape, using vivid imagery of wild winds, snow, and endless wastes to convey a feeling of being bound by a "tyrant spell" that, despite its gloom, the speaker welcomes, refusing to leave due to an internal resolve or connection stronger than external dread. The poem sets a scene of impending storm and desolation, but the speaker's repeated insistence, "I will not, cannot go," reveals a chosen captivity, highlighting themes of nature, internal feeling, and a powerful, binding emotion. """, "description":"""The Night is Darkening Round Me""", "author":"""Emily Brontë"""}
{ "index": {} }
{ "attachmentData": """The Magic Mountain (1924) by Thomas Mann is a monumental novel about young German engineer Hans Castorp, who visits his cousin at a tuberculosis sanatorium in the Swiss Alps, intending a short stay but getting drawn into the isolated, timeless world of illness, philosophy, and pre-WWI European culture for seven years, exploring life, death, love (with Clavdia Cauchat), and politics before being pulled back to the "flatland" and the outbreak of war. It's a philosophical bildungsroman (coming-of-age story) using the microcosm of the Berghof sanatorium to reflect the macrocosm of a world on the brink of chaos, contrasting health and sickness, spirit and flesh, and intellect versus instinct. """, "description":"""The Magic Mountain""", "author":"""Thomas Mann"""}
{ "index": {} }
{ "attachmentData": """The Unbearable Lightness of Being's introduction sets up the novel's core philosophical dilemma: the conflict between "lightness" (meaninglessness, freedom from consequence) and "weight" (purpose, responsibility, eternal return), using the backdrop of Prague during the 1968 Soviet invasion to explore these ideas through the interwoven lives of surgeon Tomas, his wife Tereza, his mistress Sabina, and her lover Franz, blending love, politics, and existential questions. It immediately contrasts Nietzsche's eternal return (heavy) with Parmenides' concept of single-occurrence life (light), suggesting life's fleeting moments make choices weightless, a tension central to the characters' struggles with love, fidelity, and freedom. """, "description":"""The Unbearable Lightness of Being""", "author":"""Milan Kundera"""}

#CHECK RECORDS
GET innerhits_expansion_error_index/_search
{
  "query": {
    "match_all": {}
  }
}

#SEARCH PIPELINE
PUT /_search/pipeline/innerhits_expansion_error_index_search_pipeline
{
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.5,
              0.5
            ]
          }
        }
      }
    }
  ],
  "request_processors": [
    {
      "neural_query_enricher": {
        "default_model_id": "wxPEApsBh9UO9dEM5IFh"
      }
    }
  ]
}

#SEARCH WITH COLLAPSE (NO INNERHITS) - no error
GET innerhits_expansion_error_index/_search?search_pipeline=innerhits_expansion_error_index_search_pipeline
{
  "size": 50,
  "query": {
    "hybrid": {
      "queries": [
        {
          "query_string": {
            "query": "storm~1",
            "fields": [
              "description^2.0",
              "attachmentData"
            ]
          }
        },
        {
          "neural": {
            "chunk_embedding": {
              "query_text": "storm"
            }
          }
        }
      ]
    }
  },
  "_source": {
    "excludes": "chunk_embedding"
  },
      "collapse": {
    "field": "author"
  }
}

#SEARCH WITH COLLAPSE (WITH INNERHITS) - ERROR
GET innerhits_expansion_error_index/_search?search_pipeline=innerhits_expansion_error_index_search_pipeline
{
  "size": 50,
  "query": {
    "hybrid": {
      "queries": [
        {
          "query_string": {
            "query": "storm~1",
            "fields": [
              "description^2.0",
              "attachmentData"
            ]
          }
        },
        {
          "neural": {
            "chunk_embedding": {
              "query_text": "storm"
            }
          }
        }
      ]
    }
  },
  "_source": {
    "excludes": "chunk_embedding"
  },
  "collapse": {
    "field": "author",
    "inner_hits": [
      {
        "size": 100,
        "name": "innerHits",
        "_source": {
          "includes": [
            "attachmentData"
          ]
        }
      }
    ]
  }
}

Am I doing somehting wrong in terms of query structure?

@adrianahariuc This not something you are doing wrong, I would recommend to have a look at the RFC for collapse in hybrid query, in particular I believe what you are trying to achieve is marked as out of scope in the following paragraph:

Out of Scope:
Users can customize the number of documents per collapsed field and apply sorting criteria to these documents, such as by a different field. While the feature typically includes inner hits functionality for detailed document viewing, this capability is currently unavailable in hybrid search as it remains under development. Our team will test both the inner hits design and the collapse Proof of Concept (POC) to identify any necessary modifications for supporting inner hits within collapse functionality in hybrid search. For detailed technical specifications, please refer to the RFC document on hybrid search inner hits implementation here. To understand the intricate relationship between collapse and inner hits, consult the appendix section titled “Inner Hits with Collapse”, which provides a comprehensive breakdown of their interaction.

I think what I am trying to achieve is documented as an available feature since the release 3.2 (before I mistakenly wrote 3.0). Here is the link to the official documentation:

@adrianahariuc
I’ve reproduced this on a clean 3.3.2 cluster using the docs style pipeline and a small sample index.

  • Hybrid + collapse on a keyword field works.
  • As soon as I add collapse.inner_hits, the search fails with SearchPhaseExecutionException: failed to expand hits and an underlying EOFException: read past EOF ... _0.cfs [slice=_0.nvd] [slice=randomaccess], coming from Lucene90NormsProducer/TermScorer/HybridQueryScorer.

As the docs say collapse.inner_hits is supported for hybrid queries since 3.2, so this doesn’t look like an unsupported combo, it appears to be a bug in the hybrid/expand path when a neural subquery is involved.

I would recommend to raise an issue for this here.

I created a new issue. Here is the link to it: