OpenSearch Lucene Study Group Meeting - Monday, November 20th

Welcome to the first public meeting of the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

In this meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

4 Likes

New issues to review this week:

CategoryDescriptionLink
API ChangesAdd TaxonomyReader#getBulkOrdinals method to more efficiently retrieve facet ordinals for multiple FacetLabel at once.https://github.com/apache/lucene/issues/12180
New FeaturesAdded similarityToQueryVector API to compute vector similarity scores with DoubleValuesSource.https://github.com/apache/lucene/issues/12548
New FeaturesLucene now records if documents have been indexed as blocks in SegmentInfo. This is recorded on a per segment basis and maintained across merges. The property is exposed via LeafReaderMetadata.https://github.com/apache/lucene/issues/12685
New FeaturesAdd new Lucene99FlatVectorsFormat for writing vectors in a flat format and refactor Lucene99HnswVectorsFormat to reuse the flat format. Added new Lucene99HnswQuantizedVectorsFormat that uses quantized vectors for its flat storage.https://github.com/apache/lucene/issues/12729
ImprovementsRemove possible contention on a ReentrantReadWriteLock in Monitor which could result in searches waiting for commits.https://github.com/apache/lucene/issues/12801
ImprovementsLUCENE-10241: Upgrade to OpenNLP to 1.9.4.https://github.com/apache/lucene/issues/11277,
OptimizationsSkip docs with DocValues in NumericLeafComparator.https://github.com/apache/lucene/issues/12381
OptimizationsSpecialize arc store for continuous label in FST.https://github.com/apache/lucene/issues/12748
OptimizationsCache buckets to speed up BytesRefHash#sort.https://github.com/apache/lucene/issues/12784
OptimizationsUtilize exact kNN search when gathering k >= numVectors in a segmenthttps://github.com/apache/lucene/issues/12806
Bug FixesTestIndexWriterOnVMError.testUnknownError times out (fixes potential IndexWriter deadlock with tragic exceptions).https://github.com/apache/lucene/issues/12654
Bug FixesStop exploring HNSW graph if scores are not getting better.https://github.com/apache/lucene/issues/12770
Bug FixesEnsure #finish is called on all drill-sideways collectors even if one throws a CollectionTerminatedExceptionhttps://github.com/apache/lucene/issues/12640
Bug FixesFix segmentInfos replace to set userDatahttps://github.com/apache/lucene/issues/12626
1 Like

This was previously an internal Amazon meeting, but we’ve taken it public. This is the “homework” assigned during the last internal meeting that we’ll review in this week’s meeting.

CategoryDescriptionLinkOwner
API ChangesAutomata#makeStringUnion #makeBinaryStringUnion now accept Iterable instead of Collection. They also now explicitly throw IllegalArgumentException if input data is not properly sorted instead of relying on assert.https://github.com/apache/lucene/issues/12427Rishabh Kumar Maurya
New FeaturesAdd int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy storage for the vectors, requiring about 75% memory for fast HNSW search.https://github.com/apache/lucene/issues/12582Navneet Verma
New FeaturesHNSW graph now can be merged with multiple thread. Configurable in Lucene99HnswVectorsFormat.https://github.com/apache/lucene/issues/12660Navneet Verma
OptimizationsDisjunctions now sometimes run as conjunctions when the minimum competitive score requires multiple clauses to match.https://github.com/apache/lucene/issues/12589Rishabh Kumar Maurya
OptimizationsTop-level conjunctions that are not sorted by score now have a specialized bulk scorer.https://github.com/apache/lucene/issues/12719Saurabh Singh
OptimizationsFaster merging of terms enums.https://github.com/apache/lucene/issues/1052Michael Froh
OptimizationsFaster sort on high-cardinality string fields.https://github.com/apache/lucene/issues/11903Harsha Vamsi Kalluri
Bug FixesEnsure negative scores are not returned by vector similarity functionshttps://github.com/apache/lucene/issues/12727Jack Mazanec
1 Like

I recently opened one ticket in Lucene and I would like to get some feedback on it. It can be a good start to explain some basic concepts and testing framework:

1 Like

On Issues · apache/lucene · GitHub

  • Increase number of hits fetched.
  • Check impact on search_after after we upgrade to Lucene 9.9
1 Like

Suggestion from Ankit Jain – let’s collect Lucene videos. Add links to this post.

PR link Assignees Notes
[12180](https://github.com/apache/lucene/issues/12180) Discuss issue in OS if we need taxonomy index support
[12548](https://github.com/apache/lucene/pull/12548) samuel-oci KNN - Similarity to query vector
[12685](https://github.com/apache/lucene/pull/12685) reta For nested docs, segment info will have a flag. See if we can skip segments using this for nested queries
[12729](https://github.com/apache/lucene/pull/12729) amistrn FlatVectorFormat: New internal lucene storage format for HNSW vectors
[12405](https://github.com/apache/lucene/pull/12405) jainankitk See if it can help with aggregation related optimizations
[12784](https://github.com/apache/lucene/pull/12784) sohami Cache bucket optimization for BytesRefHash#sort. Understand this optimization
1 Like

I’ll create a GH issue in OpenSearch to discuss this.

Hi @sohami

I see we missed adding an entry for Does Opensearch need constant_keyword fieldtype to me (GitHub → hasnain2808)

1 Like

Issue created: Should we consider adding support for taxonomy indices? · Issue #11355 · opensearch-project/OpenSearch · GitHub