OpenSearch Lucene Study Group Meeting - Monday, August 19th, 2024

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, June 17th, 2024

It’s been a couple of months since our last meeting. Late June was busy, then I was on vacation for almost all of July. Hopefully we’ll be able to get back to a weekly cadence moving forward.

There will probably be too many changes to review all of them this week, but I’ll try to do some triage ahead of the meeting to pick some to focus on. I’ll still run my script that collects all the issues and post the results here. If I skip over something that sounds interesting, please bring it up for this meeting or a future one.


Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

VersionCategoryDescriptionLink
Lucene 10.0.0API ChangesRemove deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager)https://github.com/apache/lucene/issues/13499
Lucene 10.0.0API ChangesCandidateMatcher public matching functionshttps://github.com/apache/lucene/issues/13632
Lucene 10.0.0New FeaturesAdd levels to doc values skip index.https://github.com/apache/lucene/issues/13563
Lucene 10.0.0New FeaturesAlign doc value skipper interval boundaries when an interval contains a constant value.https://github.com/apache/lucene/issues/13597
Lucene 10.0.0New FeaturesAdd Kmeans clustering on vectorshttps://github.com/apache/lucene/issues/13604
Lucene 10.0.0OtherRemove usage of TopScoreDocCollector + TopFieldCollector deprecated methods (#create, #createSharedManager)https://github.com/apache/lucene/issues/13499
Lucene 10.0.0BuildFix eclipse ide settings generation #13649https://github.com/apache/lucene/issues/13649
Lucene 9.12.0API ChangesExpose FlatVectorsFormat as a first-class format; can be configured using a custom Codec.https://github.com/apache/lucene/issues/13469
Lucene 9.12.0API ChangesHunspell: add Suggester#proceedPastRep to avoid losing relevant suggestions.https://github.com/apache/lucene/issues/13612
Lucene 9.12.0API ChangesIntroduced `IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` protected method to facilitate customizing per-leaf behavior of search without requiring to override `search(LeafReaderContext[], Weight, Collector)` which requires overriding the entire loop across the leaveshttps://github.com/apache/lucene/issues/13603
Lucene 9.12.0API ChangesAdd BitSet#nextSetBit(int, int) to get the index of the first set bit in range.https://github.com/apache/lucene/issues/13559
Lucene 9.12.0API ChangesAdd DoubleValuesSource#toSortableLongDoubleValuesSource and MultiDoubleValuesSource#toSortableMultiLongValuesSource methods.https://github.com/apache/lucene/issues/13568
Lucene 9.12.0API ChangesAdd CollectorOwner class that wraps CollectorManager, and handles list of Collectors and results. Add IndexSearcher#search method that takes CollectorOwner.https://github.com/apache/lucene/issues/13568
Lucene 9.12.0API ChangesAdd DrillSideways#search method that supports any collector types for any drill-sideways dimensions or drill-down.https://github.com/apache/lucene/issues/13568
Lucene 9.12.0New FeaturesAllow configuring the search concurrency via TieredMergePolicy#setTargetSearchConcurrency. This in-turn instructs the merge policy to try to have at least this number of segments on the highest tier.https://github.com/apache/lucene/issues/13430
Lucene 9.12.0New FeaturesAllow configuring the search concurrency on LogDocMergePolicy and LogByteSizeMergePolicy via a new #setTargetConcurrency setter.https://github.com/apache/lucene/issues/13517
Lucene 9.12.0New FeaturesAdd sandbox facets module to compute facets while collecting.https://github.com/apache/lucene/issues/13568
Lucene 9.12.0ImprovementsRefactor and javadoc update for KNN vector writer classes.https://github.com/apache/lucene/issues/13548
Lucene 9.12.0ImprovementsAdd Intervals.regexp and Intervals.range methods to produce IntervalsSource for regexp and range queries.https://github.com/apache/lucene/issues/13562
Lucene 9.12.0ImprovementsRemove BitSet#nextSetBit code duplication.https://github.com/apache/lucene/issues/13625
Lucene 9.12.0ImprovementsEarly terminate graph searches of AbstractVectorSimilarityQuery to follow timeout set from IndexSearcher#setTimeout(QueryTimeout).https://github.com/apache/lucene/issues/13285
Lucene 9.12.0ImprovementsAdd ability to read/write knn vector values to a MemoryIndex.https://github.com/apache/lucene/issues/13633
Lucene 9.12.0Improvementspatch HNSW graphs to improve reachability of all nodes from entry pointshttps://github.com/apache/lucene/issues/12627
Lucene 9.12.0ImprovementsBetter cost estimation on MultiTermQuery over few terms.https://github.com/apache/lucene/issues/13201
Lucene 9.12.0OptimizationsStop double-checking priority queue inserts in some FacetCount classes.https://github.com/apache/lucene/issues/13175
Lucene 9.12.0OptimizationsSlightly reduce heap usage for HNSW and scalar quantized vector writers.https://github.com/apache/lucene/issues/13538
Lucene 9.12.0OptimizationsWordBreakSpellChecker.suggestWordBreaks now does a breadth first search, allowing it to return better matches with fewer evaluationshttps://github.com/apache/lucene/issues/12100
Lucene 9.12.0OptimizationsStop requiring MaxScoreBulkScorer's outer window from having at least INNER_WINDOW_SIZE docs.https://github.com/apache/lucene/issues/13582
Lucene 9.12.0OptimizationsGITHUB#13574, GITHUB#13535: Avoid performance degradation with closing shared Arenas. Closing many individual index files can potentially lead to a degradation in execution performance. Index files are mmapped one-to-one with the JDK's foreign shared Arena. The JVM deoptimizes the top few frames of all threads when closing a shared Arena (see JDK-8335480). We mitigate this situation by 1) using a confined Arena where appropriate, and 2) grouping files from the same segment to a single shared Arena.https://github.com/apache/lucene/issues/13570,
Lucene 9.12.0OptimizationsLucene912PostingsFormat, the new default postings format, now only has 2 levels of skip data, which are inlined into postings instead of being stored at the end of postings lists. This translates into better performance for queries that need skipping such as conjunctions.https://github.com/apache/lucene/issues/13585
Lucene 9.12.0OptimizationsOnHeapHnswGraph no longer allocates a lock for every graph nodehttps://github.com/apache/lucene/issues/13581
Lucene 9.12.0OptimizationsGITHUB#13658: Optimizations to the decoding logic of blocks of postings.https://github.com/apache/lucene/issues/13636,
Lucene 9.12.0OptimizationsImprove NumericComparator competitive iterator logic by comparing the missing value with the top value even after the hit queue is fullhttps://github.com/apache/lucene/issues/#13644
Lucene 9.12.0Changes in runtime behaviorWhen an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1 to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no longer required.https://github.com/apache/lucene/issues/13472
Lucene 9.12.0Bug FixesFix highlighter to use longer passages instead of shorter individual terms.https://github.com/apache/lucene/issues/13384
Lucene 9.12.0Bug FixesAddress bug in MultiLeafKnnCollector causing #minCompetitiveSimilarity to stay artificially low in some corner cases.https://github.com/apache/lucene/issues/13463
Lucene 9.12.0Bug FixesCorrect RamUsageEstimate for scalar quantized knn vector formats so that raw vectors are correctly accounted for.https://github.com/apache/lucene/issues/13553
Lucene 9.12.0Bug FixesCorrect scalar quantization when used in conjunction with COSINE similarity. Vectors are normalized before quantization to ensure the cosine similarity is correctly calculated.https://github.com/apache/lucene/issues/13615
Lucene 9.12.0Bug FixesFix race condition on flush for DWPT seqNo generation.https://github.com/apache/lucene/issues/13627
Lucene 9.11.1Bug FixesAvoid performance regression by constructing lazily the PointTree in NumericComparator.https://github.com/apache/lucene/issues/13498
Lucene 9.11.1Bug FixesGITHUB#13478: Remove intra-merge parallelism for everything except HNSW graph merges.https://github.com/apache/lucene/issues/13501,
Lucene 9.11.1Bug FixesGITHUB#13340: Allow adding a parent field to an index with no fieldshttps://github.com/apache/lucene/issues/13498,
Lucene 9.11.1Bug FixesFix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches.https://github.com/apache/lucene/issues/12431
Lucene 9.11.1Bug FixesStringValueFacetCounts stops throwing NPE when faceting over an empty match-set.https://github.com/apache/lucene/issues/13493
Lucene 9.10.0New FeaturesFor indices newly created as of 9.10.0 onwards, IndexWriter preserves document blocks indexed via IndexWriter#addDocuments or IndexWriter#updateDocuments also when index sorting is configured. Document blocks are maintained alongside their parent documents during sort and merge. IndexWriterConfig accepts a parent field that is used to maintain block orders if index sorting is used. Note, this is fully optional in Lucene 9.x while will be mandatory for indices that use document blocks together with index sorting as of 10.0.0.https://github.com/apache/lucene/issues/12829