This document proposes integrating [Lucene 9.0’s recent addition of Approximate …Nearest Neighbor search](https://issues.apache.org/jira/browse/LUCENE-9004) into OpenSearch.
## Problem Statement
k-NN Search has gained popularity over recent years with the growing use of ML. A few applications include recommendation systems, NLP services, and computer vision.
A brute force nearest neighbor search consists of calculating the distance from a query point to every point in the index and returning the “k” index points that have the smallest distance. A point can be a dense vector, a sparse vector or a binary code; however, in this document, we are only interested in dense vectors. The major drawback of this algorithm is that it is unable to search indices with millions or billions of vectors efficiently. To address this, a new problem was formulated: [Approximate Nearest Neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods) or ANN for short. The general idea is to sacrifice search quality and index latency to drastically improve search latency. Several algorithms have been created, including Hierarchical Small Worlds (HNSW), Inverted File System (IVF), and Locality Sensitive Hashing (LSH).
Several different libraries exist that implement these algorithms, including _nmslib_ and _faiss_. The libraries are similar to Lucene in that they build indices that can be searched and serialized — however, they are built specifically for vector search. The k-NN plugin integrates these libraries into OpenSearch so that OpenSearch users can perform ANN search workloads.
The approach the plugin takes has a few drawbacks:
1. The libraries mentioned above are implemented in C++. In a Java project like OpenSearch, this is a bit unnatural — it complicates the build process and prevents us from supporting different systems, such as [Windows](https://github.com/opensearch-project/opensearch-build/issues/33).
2. The library indices must be loaded in totality into native memory in order to search them. The indices contain both the vectors and the underlying data structures. This causes issues with memory pressure when memory is tight.
3. Developers that want to use vectors in their applications have to take a dependency on the k-NN plugin to get access to the knn_vector data type the plugin introduces.
## Proposed Solution
TLDR: Build a new field type and query type in OpenSearch to support ANN search through Lucene.
**Lucene 9’s introduction of Approximate Nearest Neighbor Support**
In the [9.0 release](https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes+9.0), Lucene added support for building indices with dense vectors and running ANN search on them. In order to create indices with vectors, a new Codec format was introduced, _KnnVectorsFormat_. This format is used to encode and decode numeric vector fields and execute Nearest Neighbor search. In Lucene 9, the default format used is _Lucene90HnswVectorsFormat_. This format implements the [HNSW algorithm](https://arxiv.org/abs/1603.09320) for ANN search. During indexing, this format will create 2 separate segment files: one for the vectors and one for the HNSW graph structure, allowing the vectors to exist off heap.
Lucene’s implementation of HNSW takes two parameters at index time: _max_connections_ and _beam_width_. _Max_connections_ sets a ceiling on the number of connections a node in the graph can have. _Beam_width_ is a parameter that controls the candidate list for neighbors of a new node added to the graph. These parameters map to _m_ and _ef_construction_ parameters from [the original paper](https://arxiv.org/abs/1603.09320). These parameters are set in _Lucene91HnswVectorsFormat_’s constructor. Codec supports KnnVectorFormat at the field level, and this allows setting these parameters at the field level.
For search, a new query was introduced, the _KnnVectorQuery_. This query contains the query vector as well as k, the number of results a query on the HNSW graph should return. For search, the HNSW algorithm has one parameter, _ef_search_. In Lucene’s implementation this value is [hardcoded](https://github.com/apache/lucene/blob/releases/lucene/9.0.0/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java#L237) to k.
**Integration with OpenSearch**
In order to integrate this new functionality into OpenSearch, we need to introduce a new field type, _dense_vector_. This field type would allow users to index floating point arrays of uniform length into OpenSearch and build the index structures needed to support ANN search with Lucene.
```
{
"type": "dense_vector",
"dtype": float,
"dimension": d,
"knn": {
"metric": "l2",
"algorithm": {
"name": "hnsw",
"params": {
"max_connections": X,
"beam_width": X
}
}
}
}
```
Because initially we are going to provide support for Lucene HNSW implementation format above can be simplified to a following form
```
{
"type": "dense_vector",
"dimension": d,
"knn": {
}
}
```
For provided request example system will use Lucene HNSW with _L2_ metric, _max_connections_ equals 16 and _beam_width_ equals 100.
The minimal default mapping for __dense_vector__ does not include __knn__ structure and is shown below. For such definition the proposed knn based on Lucene HNSW will not be enabled and related graph data structures are not created; only vector values are stored.
```
{
"type": "dense_vector",
"dimension": d
}
```
In addition to the field type, we are adding support a new query type naming _knn-query_.
Lucene implementation of HNSW and knn vector type has certain limitations that we have to accept with such integration - maximum number for dimensions for knn vector is 1024, this implies same limitation for new _dense_vector_ field type.
**Performance**
Team has benchmarked prototype of solution based on Lucene 9.1 integration with OpenSearch 2.0.
For benchmark algorithm parameters _ef_construction_ and _beam_width_ set to 512, and _max_connections_ and _m_ varied in range from 4 to 96.
Benchmark results showed that Lucene 9.1 solution is comparable with existing k-NN hnsw implementation based on _nmslib_ with certain tradeoffs: Lucene 9.1 solution cannot reach high recall values close to 1.0, but when recall values are comparable with existing k-NN plugin Lucene 9.1 solution consumes less resources and has better query latencies.
We have observed some trends in [benchmark results](https://gist.github.com/martin-gaievski/7a0d83f57d607392642f87e43f40f606) - for the same high recall values memory that the k-NN index takes is higher than required for Lucene implementation. Query latencies and memory allocated per query are higher for k-NN implementation.
Solution shows itself as stable under constant query load that is typical for k-NN plugin.
Obtained metrics are rough as they are based on [POC code](https://github.com/martin-gaievski/OpenSearch/tree/lucene-ann-integ) of Lucene 9.1 integration with OpenSearch 2.0 without any optimizations. For production ready version we’re planning to have more accurate metrics.
**Pre-filtering feature**
Lucene implementation of HNSW gives additional benefit of having pre-filtering on queries. Current implementation in k-NN provides post-filtering - first documents are selected by applying HNSW algorithm and then we apply filter on top of HNSW results. In general results are not exactly correct due to sequence in which different processing steps are applied.
Lucene 9.1 applies filter on live-docs and then selects k closest results by applying HNSW algorithm.
At the moment we’re planning to deliver pre-filtering feature “as is”, no changes/additions to what is implemented in Lucene.
**Impact on k-NN plugin**
Existing _knn_vector_ field supported by k-NN plugin will be independent of the _dense_vector_ field introduced in core OpenSearch. This change will not break any functionality of the plugin, and customers will be able to upgrade their existing indices created in earlier k-NN versions taking into account limitation on maximum vector dimensions.
From a migration standpoint, users would be able to reindex their _knn_vector_ indices into indices that use the new _dense_vector_ field type and vice versa.
In order to share the knn query type with OpenSearch, the plugin’s Codec would be required to implement the new KnnVectorsFormat. More details on this will be available in the [plugins repository](https://github.com/opensearch-project/k-NN).
## Revised Approach
### Overview
After some offline discussions, we have decided to propose a pivot to our plan to support Lucene k-NN.
Introducing a new data type , dense_vector, and query type, knn_query, when we already have the data type, knn_vector, and the query type, knn, in the plugin will be a source of confusion for users. Having two data types with very similar end user functionality will raise questions such as “Why aren’t they the same?”, “When should I use what?”, etc. Instead of introducing dense_vector, we are proposing to capture the Lucene functionality inside the existing knn_vector field type in the plugin. Then, we would use the existing knn query type as well. The interface would be consistent with our current knn_vector.
As a consequence of this change, because only one field mapping parser can be registered to a field name (as well as query parser to query name), we would need to move the implementation to the k-NN plugin. We would still target the 2.2 release.
### Using existing knn_vector field type instead of new dense_vector
#### Original Proposal Limitations
**User Experience**
The main limitation to this approach is the confusing user experience that it creates. We would have 2 different data types that do very similar things: knn_vector and dense_vector. Each would support indexing dense vectors and running approximate Nearest Neighbor searches. In addition to this, we would have 2 different query types that would do a similar thing: knn_query and knn.
The decision to make a new data type and query type was made originally to separate from architecture specific nature of the k-NN plugin. However, on closer examination, a user could use the k-NN plugin without the native libraries on a different platform as long as we update plugin build infrastructure to decouple platform dependent components from platform independent components.
**Updated Proposal**
To solve the problem of user confusion, we would build Lucene ANN support into our current knn_vector interface:
```
{
"type": "knn_vector",
"dimension": 100,
"method": {
"name":"hnsw",
"engine":"lucene",
"space_type": "l2",
"parameters":{
"m":2048,
"ef_construction": 245
}
}
}
```
The interface would stay the same except for adding lucene as a new engine to support. We have evaluated the technical feasibility of this approach and deemed it feasible.
With this, the query would stay the same as our current query for the plugin:
```
"query": {
"knn": {
"target_field": {
"vector": [<insert vector>],
"k": 2
}
}
}
```
Note that in the future, we could update these interfaces to how we see fit. For more information on current interface, please take a look at the [k-NN documentation](https://opensearch.org/docs/1.2/search-plugins/knn/approximate-knn/).
#### Updated Proposal Limitations
**Initial Coupling to Architecture Dependent Nature of Plugin**
One limitation with the updated proposal is that it couples the Lucene ANN platform agnostic functionality with the plugin’s platform dependence.
However, a user that would want to run the plugin on a non-plugin supported platform, could install the plugin zip without the libraries and use the Lucene engine. As it stands right now, if they tried to use nmslib or faiss, the process would crash.
Overall, we would solve this problem by decoupling the JNI library from the plugin and providing a graceful failure if a user tries to use the JNI components on an unsupported platform. Additionally, we would decouple the build process as well.
**Difficulty integrating with other engine plugins**
Another limitation lies in a more technical nuance. From the user perspective, the consequence is that they could not create a Lucene k-NN index with another Engine plugin feature like CCR.
This is because, we have to pass the HNSW parameters to the Lucene KnnVectorsFormat during [construction](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsFormat.java#L143). From a plugin, we have no way to add this to the [default Codec](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/codec/PerFieldMappingPostingFormatCodec.java) that OpenSearch will use. Therefore, we will need to use our own custom codec. But, two custom codecs [cannot be used](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/engine/EngineConfigFactory.java#L72) at the same time for an index.
In the short term, this will mean users cannot use the Lucene KNN with another feature like CCR. This limitation is currently present for the plugin as well.
Related github issues:
- https://github.com/opensearch-project/OpenSearch/issues/1254
- https://github.com/opensearch-project/OpenSearch/issues/2447
- https://github.com/opensearch-project/k-NN/issues/147
## Feedback Requested
Any general comments about the overall direction are welcome. Some specific questions:
- Do you prefer exposing Lucene implementation via existing _knn_vector_ field or by new _dense_vector_ field?
- Do you have use cases where close to 1.0 recall is not critical for k-NN search? For such use cases are you willing to make a tradeoff of lower recall comparing to a less memory consumption and lower query latencies?
- Is L2 Euclidean distance a good default metric for your use cases? Other metrics supported by Lucene are `Dot Product` and `Cosine Similarity`.