Is the k-NN plugin based on Apache OpenNLP

johnt · March 3, 2023, 1:12am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.4

Describe the issue:
Does K-NN plugin or any NLP features from OpenSearch use Apache OpenNLP code?

Configuration:

Relevant Logs or Screenshots:

Navneet · March 3, 2023, 3:13am

What is the exact code you are looking for here? It can help us provide exact ans for this question.

johnt · March 3, 2023, 3:35am

I’m just doing a comparison between Elasticsearch and Opensearch for by boss. I’m not looking for any specific code. Just wondering if Opensearch uses Apache OpenNLP for any NLP features. I know that Elasticsearch and Opensearch uses Apache Lucene for keyword search. Was wondering if Opensearch relies heavily on Apache OpenNLP for semantic search.

Navneet · March 3, 2023, 4:18am

For semantics search Opensearch provides the capabilities to load custom model of customers and also provides their own list if model which can be used. As per my initial investigation Opensearch is not using OpenNLP.

Refer for Semantics search: Neural Search plugin - OpenSearch documentation

And for ML model refer this: Pretrained models - OpenSearch documentation

And details on ML can be found here: About ML Commons - OpenSearch documentation

johnt · March 3, 2023, 5:28am

I’ve been working with OpenSearch’s neural plugin for the last three months. I have it working on my laptop. I also know how to train a BERT Transformer model and convert a Pytorch BERT Transformer model to torchscript. Thank you for sharing your investigation, though.

Navneet · March 3, 2023, 7:31am

So is your query answered? I didn’t get it from your last reply.

But its good to know that you are already using the neural plugin.

dtaivpp · March 3, 2023, 3:51pm

Just to add some context we moved to use Lucene’s implementation of kNN. Here is the RFC that describes the work that was done.

github.com/opensearch-project/OpenSearch

[RFC] Lucene based kNN search support in core OpenSearch

opened 11:43PM - 08 Jun 22 UTC

closed 11:51PM - 04 Aug 22 UTC

martin-gaievski

enhancement RFC

This document proposes integrating [Lucene 9.0’s recent addition of Approximate …Nearest Neighbor search](https://issues.apache.org/jira/browse/LUCENE-9004) into OpenSearch. ## Problem Statement k-NN Search has gained popularity over recent years with the growing use of ML. A few applications include recommendation systems, NLP services, and computer vision. A brute force nearest neighbor search consists of calculating the distance from a query point to every point in the index and returning the “k” index points that have the smallest distance. A point can be a dense vector, a sparse vector or a binary code; however, in this document, we are only interested in dense vectors. The major drawback of this algorithm is that it is unable to search indices with millions or billions of vectors efficiently. To address this, a new problem was formulated: [Approximate Nearest Neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods) or ANN for short. The general idea is to sacrifice search quality and index latency to drastically improve search latency. Several algorithms have been created, including Hierarchical Small Worlds (HNSW), Inverted File System (IVF), and Locality Sensitive Hashing (LSH). Several different libraries exist that implement these algorithms, including _nmslib_ and _faiss_. The libraries are similar to Lucene in that they build indices that can be searched and serialized — however, they are built specifically for vector search. The k-NN plugin integrates these libraries into OpenSearch so that OpenSearch users can perform ANN search workloads. The approach the plugin takes has a few drawbacks: 1. The libraries mentioned above are implemented in C++. In a Java project like OpenSearch, this is a bit unnatural — it complicates the build process and prevents us from supporting different systems, such as [Windows](https://github.com/opensearch-project/opensearch-build/issues/33). 2. The library indices must be loaded in totality into native memory in order to search them. The indices contain both the vectors and the underlying data structures. This causes issues with memory pressure when memory is tight. 3. Developers that want to use vectors in their applications have to take a dependency on the k-NN plugin to get access to the knn_vector data type the plugin introduces. ## Proposed Solution TLDR: Build a new field type and query type in OpenSearch to support ANN search through Lucene. **Lucene 9’s introduction of Approximate Nearest Neighbor Support** In the [9.0 release](https://cwiki.apache.org/confluence/display/LUCENE/Release+Notes+9.0), Lucene added support for building indices with dense vectors and running ANN search on them. In order to create indices with vectors, a new Codec format was introduced, _KnnVectorsFormat_. This format is used to encode and decode numeric vector fields and execute Nearest Neighbor search. In Lucene 9, the default format used is _Lucene90HnswVectorsFormat_. This format implements the [HNSW algorithm](https://arxiv.org/abs/1603.09320) for ANN search. During indexing, this format will create 2 separate segment files: one for the vectors and one for the HNSW graph structure, allowing the vectors to exist off heap. Lucene’s implementation of HNSW takes two parameters at index time: _max_connections_ and _beam_width_. _Max_connections_ sets a ceiling on the number of connections a node in the graph can have. _Beam_width_ is a parameter that controls the candidate list for neighbors of a new node added to the graph. These parameters map to _m_ and _ef_construction_ parameters from [the original paper](https://arxiv.org/abs/1603.09320). These parameters are set in _Lucene91HnswVectorsFormat_’s constructor. Codec supports KnnVectorFormat at the field level, and this allows setting these parameters at the field level. For search, a new query was introduced, the _KnnVectorQuery_. This query contains the query vector as well as k, the number of results a query on the HNSW graph should return. For search, the HNSW algorithm has one parameter, _ef_search_. In Lucene’s implementation this value is [hardcoded](https://github.com/apache/lucene/blob/releases/lucene/9.0.0/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsReader.java#L237) to k. **Integration with OpenSearch** In order to integrate this new functionality into OpenSearch, we need to introduce a new field type, _dense_vector_. This field type would allow users to index floating point arrays of uniform length into OpenSearch and build the index structures needed to support ANN search with Lucene. ``` { "type": "dense_vector", "dtype": float, "dimension": d, "knn": { "metric": "l2", "algorithm": { "name": "hnsw", "params": { "max_connections": X, "beam_width": X } } } } ``` Because initially we are going to provide support for Lucene HNSW implementation format above can be simplified to a following form ``` { "type": "dense_vector", "dimension": d, "knn": { } } ``` For provided request example system will use Lucene HNSW with _L2_ metric, _max_connections_ equals 16 and _beam_width_ equals 100. The minimal default mapping for __dense_vector__ does not include __knn__ structure and is shown below. For such definition the proposed knn based on Lucene HNSW will not be enabled and related graph data structures are not created; only vector values are stored. ``` { "type": "dense_vector", "dimension": d } ``` In addition to the field type, we are adding support a new query type naming _knn-query_. Lucene implementation of HNSW and knn vector type has certain limitations that we have to accept with such integration - maximum number for dimensions for knn vector is 1024, this implies same limitation for new _dense_vector_ field type. **Performance** Team has benchmarked prototype of solution based on Lucene 9.1 integration with OpenSearch 2.0. For benchmark algorithm parameters _ef_construction_ and _beam_width_ set to 512, and _max_connections_ and _m_ varied in range from 4 to 96. Benchmark results showed that Lucene 9.1 solution is comparable with existing k-NN hnsw implementation based on _nmslib_ with certain tradeoffs: Lucene 9.1 solution cannot reach high recall values close to 1.0, but when recall values are comparable with existing k-NN plugin Lucene 9.1 solution consumes less resources and has better query latencies. We have observed some trends in [benchmark results](https://gist.github.com/martin-gaievski/7a0d83f57d607392642f87e43f40f606) - for the same high recall values memory that the k-NN index takes is higher than required for Lucene implementation. Query latencies and memory allocated per query are higher for k-NN implementation. Solution shows itself as stable under constant query load that is typical for k-NN plugin. Obtained metrics are rough as they are based on [POC code](https://github.com/martin-gaievski/OpenSearch/tree/lucene-ann-integ) of Lucene 9.1 integration with OpenSearch 2.0 without any optimizations. For production ready version we’re planning to have more accurate metrics. **Pre-filtering feature** Lucene implementation of HNSW gives additional benefit of having pre-filtering on queries. Current implementation in k-NN provides post-filtering - first documents are selected by applying HNSW algorithm and then we apply filter on top of HNSW results. In general results are not exactly correct due to sequence in which different processing steps are applied. Lucene 9.1 applies filter on live-docs and then selects k closest results by applying HNSW algorithm. At the moment we’re planning to deliver pre-filtering feature “as is”, no changes/additions to what is implemented in Lucene. **Impact on k-NN plugin** Existing _knn_vector_ field supported by k-NN plugin will be independent of the _dense_vector_ field introduced in core OpenSearch. This change will not break any functionality of the plugin, and customers will be able to upgrade their existing indices created in earlier k-NN versions taking into account limitation on maximum vector dimensions. From a migration standpoint, users would be able to reindex their _knn_vector_ indices into indices that use the new _dense_vector_ field type and vice versa. In order to share the knn query type with OpenSearch, the plugin’s Codec would be required to implement the new KnnVectorsFormat. More details on this will be available in the [plugins repository](https://github.com/opensearch-project/k-NN). ## Revised Approach ### Overview After some offline discussions, we have decided to propose a pivot to our plan to support Lucene k-NN. Introducing a new data type , dense_vector, and query type, knn_query, when we already have the data type, knn_vector, and the query type, knn, in the plugin will be a source of confusion for users. Having two data types with very similar end user functionality will raise questions such as “Why aren’t they the same?”, “When should I use what?”, etc. Instead of introducing dense_vector, we are proposing to capture the Lucene functionality inside the existing knn_vector field type in the plugin. Then, we would use the existing knn query type as well. The interface would be consistent with our current knn_vector. As a consequence of this change, because only one field mapping parser can be registered to a field name (as well as query parser to query name), we would need to move the implementation to the k-NN plugin. We would still target the 2.2 release. ### Using existing knn_vector field type instead of new dense_vector #### Original Proposal Limitations **User Experience** The main limitation to this approach is the confusing user experience that it creates. We would have 2 different data types that do very similar things: knn_vector and dense_vector. Each would support indexing dense vectors and running approximate Nearest Neighbor searches. In addition to this, we would have 2 different query types that would do a similar thing: knn_query and knn. The decision to make a new data type and query type was made originally to separate from architecture specific nature of the k-NN plugin. However, on closer examination, a user could use the k-NN plugin without the native libraries on a different platform as long as we update plugin build infrastructure to decouple platform dependent components from platform independent components. **Updated Proposal** To solve the problem of user confusion, we would build Lucene ANN support into our current knn_vector interface: ``` { "type": "knn_vector", "dimension": 100, "method": { "name":"hnsw", "engine":"lucene", "space_type": "l2", "parameters":{ "m":2048, "ef_construction": 245 } } } ``` The interface would stay the same except for adding lucene as a new engine to support. We have evaluated the technical feasibility of this approach and deemed it feasible. With this, the query would stay the same as our current query for the plugin: ``` "query": { "knn": { "target_field": { "vector": [<insert vector>], "k": 2 } } } ``` Note that in the future, we could update these interfaces to how we see fit. For more information on current interface, please take a look at the [k-NN documentation](https://opensearch.org/docs/1.2/search-plugins/knn/approximate-knn/). #### Updated Proposal Limitations **Initial Coupling to Architecture Dependent Nature of Plugin** One limitation with the updated proposal is that it couples the Lucene ANN platform agnostic functionality with the plugin’s platform dependence. However, a user that would want to run the plugin on a non-plugin supported platform, could install the plugin zip without the libraries and use the Lucene engine. As it stands right now, if they tried to use nmslib or faiss, the process would crash. Overall, we would solve this problem by decoupling the JNI library from the plugin and providing a graceful failure if a user tries to use the JNI components on an unsupported platform. Additionally, we would decouple the build process as well. **Difficulty integrating with other engine plugins** Another limitation lies in a more technical nuance. From the user perspective, the consequence is that they could not create a Lucene k-NN index with another Engine plugin feature like CCR. This is because, we have to pass the HNSW parameters to the Lucene KnnVectorsFormat during [construction](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsFormat.java#L143). From a plugin, we have no way to add this to the [default Codec](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/codec/PerFieldMappingPostingFormatCodec.java) that OpenSearch will use. Therefore, we will need to use our own custom codec. But, two custom codecs [cannot be used](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/engine/EngineConfigFactory.java#L72) at the same time for an index. In the short term, this will mean users cannot use the Lucene KNN with another feature like CCR. This limitation is currently present for the plugin as well. Related github issues: - https://github.com/opensearch-project/OpenSearch/issues/1254 - https://github.com/opensearch-project/OpenSearch/issues/2447 - https://github.com/opensearch-project/k-NN/issues/147 ## Feedback Requested Any general comments about the overall direction are welcome. Some specific questions: - Do you prefer exposing Lucene implementation via existing _knn_vector_ field or by new _dense_vector_ field? - Do you have use cases where close to 1.0 recall is not critical for k-NN search? For such use cases are you willing to make a tradeoff of lower recall comparing to a less memory consumption and lower query latencies? - Is L2 Euclidean distance a good default metric for your use cases? Other metrics supported by Lucene are `Dot Product` and `Cosine Similarity`.

Navneet · March 3, 2023, 6:56pm

@dtaivpp one small correction, we added the Lucene implementation of K-NN as another engine in OpenSearch along with other K-NN engines like nmslib and fassis. Lucene K-NN was not added in OpenSearch core, it was added via K-NN Plugin. I know the issue( [RFC] Lucene based kNN search support in core OpenSearch) heading is misleading. Just a simple correction.

dtaivpp · March 3, 2023, 7:14pm

Ah that is great to know. Thank you for that

johnt · March 3, 2023, 7:52pm

Thank you @Navneet and @dtaivpp . This is great information. Thanks for taking the time to resolve my question.

ylwu · March 9, 2023, 3:00am

@johnt Thanks for your interest on OpenSearch ML features. Hope you find our feature helpful. If you have some time, can you share some feedback for our feature? Like is the feature helpful? What we can do to make it better? What’s missing there for your use case ?

johnt · March 9, 2023, 3:45am

@ylwu

We are test driving the OpenSearch semantic search with different BERT models from sbert.net. We were delighted that OpenSearch 2.4 was released in November. We trained our models using Stanford’s SQUAD data set and loaded the Game of Thrones into OpenSearch for testing. We will eventually load our customer’s data into OpenSearch… We are in the process of selecting which engine to use but have noticed that faiss and nmslib have similar performance. We’ve also noticed that innerproduct is providing better search results than cosinesimil but this may change as we do more testing. We’ve also found that distill bert models such as multi-qa-distilbert-cos-v1 do not return as accurate results. I’ll have more to share once we do more testing and get approval to use customer data.

ylwu · March 9, 2023, 5:01pm

Thanks @johnt for sharing these. Hope your following test goes well. Don’t hesitate to tell us if you meet any issue. We are glad to provide help.

system · May 8, 2023, 5:01pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
k-NN roadmap in light of Lucene 9.0 k-NN	1	587	January 10, 2022
Efficient k-NN filtering with Neural Search OpenSearch	0	43	February 5, 2025
Is the K-NN comparable with the latest Elasticsearch vector search feature? Open Source Elasticsearch and Kibana	2	1094	October 2, 2019
[Feedback] Neural Search plugin - experimental release General Feedback releases	42	3551	July 18, 2023
RFC - Lucene based kNN search support in core OpenSearch Request For Comments rfc	2	469	August 9, 2022

Is the k-NN plugin based on Apache OpenNLP

Related topics