OpenSearch Lucene Study Group Meeting - Monday, November 20th

Welcome to the first public meeting of the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

In this meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.


New issues to review this week:

API ChangesAdd TaxonomyReader#getBulkOrdinals method to more efficiently retrieve facet ordinals for multiple FacetLabel at once.
New FeaturesAdded similarityToQueryVector API to compute vector similarity scores with DoubleValuesSource.
New FeaturesLucene now records if documents have been indexed as blocks in SegmentInfo. This is recorded on a per segment basis and maintained across merges. The property is exposed via LeafReaderMetadata.
New FeaturesAdd new Lucene99FlatVectorsFormat for writing vectors in a flat format and refactor Lucene99HnswVectorsFormat to reuse the flat format. Added new Lucene99HnswQuantizedVectorsFormat that uses quantized vectors for its flat storage.
ImprovementsRemove possible contention on a ReentrantReadWriteLock in Monitor which could result in searches waiting for commits.
ImprovementsLUCENE-10241: Upgrade to OpenNLP to 1.9.4.,
OptimizationsSkip docs with DocValues in NumericLeafComparator.
OptimizationsSpecialize arc store for continuous label in FST.
OptimizationsCache buckets to speed up BytesRefHash#sort.
OptimizationsUtilize exact kNN search when gathering k >= numVectors in a segment
Bug FixesTestIndexWriterOnVMError.testUnknownError times out (fixes potential IndexWriter deadlock with tragic exceptions).
Bug FixesStop exploring HNSW graph if scores are not getting better.
Bug FixesEnsure #finish is called on all drill-sideways collectors even if one throws a CollectionTerminatedException
Bug FixesFix segmentInfos replace to set userData

This was previously an internal Amazon meeting, but we’ve taken it public. This is the “homework” assigned during the last internal meeting that we’ll review in this week’s meeting.

API ChangesAutomata#makeStringUnion #makeBinaryStringUnion now accept Iterable instead of Collection. They also now explicitly throw IllegalArgumentException if input data is not properly sorted instead of relying on assert. Kumar Maurya
New FeaturesAdd int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy storage for the vectors, requiring about 75% memory for fast HNSW search. Verma
New FeaturesHNSW graph now can be merged with multiple thread. Configurable in Lucene99HnswVectorsFormat. Verma
OptimizationsDisjunctions now sometimes run as conjunctions when the minimum competitive score requires multiple clauses to match. Kumar Maurya
OptimizationsTop-level conjunctions that are not sorted by score now have a specialized bulk scorer. Singh
OptimizationsFaster merging of terms enums. Froh
OptimizationsFaster sort on high-cardinality string fields. Vamsi Kalluri
Bug FixesEnsure negative scores are not returned by vector similarity functions Mazanec

I recently opened one ticket in Lucene and I would like to get some feedback on it. It can be a good start to explain some basic concepts and testing framework:

On Issues · apache/lucene · GitHub

  • Increase number of hits fetched.
  • Check impact on search_after after we upgrade to Lucene 9.9

Suggestion from Ankit Jain – let’s collect Lucene videos. Add links to this post.

PR link Assignees Notes
[12180]( Discuss issue in OS if we need taxonomy index support
[12548]( samuel-oci KNN - Similarity to query vector
[12685]( reta For nested docs, segment info will have a flag. See if we can skip segments using this for nested queries
[12729]( amistrn FlatVectorFormat: New internal lucene storage format for HNSW vectors
[12405]( jainankitk See if it can help with aggregation related optimizations
[12784]( sohami Cache bucket optimization for BytesRefHash#sort. Understand this optimization
1 Like

I’ll create a GH issue in OpenSearch to discuss this.

Hi @sohami

I see we missed adding an entry for Does Opensearch need constant_keyword fieldtype to me (GitHub → hasnain2808)

Issue created: Should we consider adding support for taxonomy indices? · Issue #11355 · opensearch-project/OpenSearch · GitHub