OpenSearch Lucene Study Group Meeting - Monday, August 19th, 2024

msfroh · August 2, 2024, 7:58pm

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, June 17th, 2024

It’s been a couple of months since our last meeting. Late June was busy, then I was on vacation for almost all of July. Hopefully we’ll be able to get back to a weekly cadence moving forward.

There will probably be too many changes to review all of them this week, but I’ll try to do some triage ahead of the meeting to pick some to focus on. I’ll still run my script that collects all the issues and post the results here. If I skip over something that sounds interesting, please bring it up for this meeting or a future one.

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

Welcome / introduction (5 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

msfroh · August 19, 2024, 4:01pm

Version	Category	Description	Link
Lucene 10.0.0	API Changes	Remove deprecated TopScoreDocCollector + TopFieldCollector methods (#create, #createSharedManager)	https://github.com/apache/lucene/issues/13499
Lucene 10.0.0	API Changes	CandidateMatcher public matching functions	https://github.com/apache/lucene/issues/13632
Lucene 10.0.0	New Features	Add levels to doc values skip index.	https://github.com/apache/lucene/issues/13563
Lucene 10.0.0	New Features	Align doc value skipper interval boundaries when an interval contains a constant value.	https://github.com/apache/lucene/issues/13597
Lucene 10.0.0	New Features	Add Kmeans clustering on vectors	https://github.com/apache/lucene/issues/13604
Lucene 10.0.0	Other	Remove usage of TopScoreDocCollector + TopFieldCollector deprecated methods (#create, #createSharedManager)	https://github.com/apache/lucene/issues/13499
Lucene 10.0.0	Build	Fix eclipse ide settings generation #13649	https://github.com/apache/lucene/issues/13649
Lucene 9.12.0	API Changes	Expose FlatVectorsFormat as a first-class format; can be configured using a custom Codec.	https://github.com/apache/lucene/issues/13469
Lucene 9.12.0	API Changes	Hunspell: add Suggester#proceedPastRep to avoid losing relevant suggestions.	https://github.com/apache/lucene/issues/13612
Lucene 9.12.0	API Changes	Introduced `IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` protected method to facilitate customizing per-leaf behavior of search without requiring to override `search(LeafReaderContext[], Weight, Collector)` which requires overriding the entire loop across the leaves	https://github.com/apache/lucene/issues/13603
Lucene 9.12.0	API Changes	Add BitSet#nextSetBit(int, int) to get the index of the first set bit in range.	https://github.com/apache/lucene/issues/13559
Lucene 9.12.0	API Changes	Add DoubleValuesSource#toSortableLongDoubleValuesSource and MultiDoubleValuesSource#toSortableMultiLongValuesSource methods.	https://github.com/apache/lucene/issues/13568
Lucene 9.12.0	API Changes	Add CollectorOwner class that wraps CollectorManager, and handles list of Collectors and results. Add IndexSearcher#search method that takes CollectorOwner.	https://github.com/apache/lucene/issues/13568
Lucene 9.12.0	API Changes	Add DrillSideways#search method that supports any collector types for any drill-sideways dimensions or drill-down.	https://github.com/apache/lucene/issues/13568
Lucene 9.12.0	New Features	Allow configuring the search concurrency via TieredMergePolicy#setTargetSearchConcurrency. This in-turn instructs the merge policy to try to have at least this number of segments on the highest tier.	https://github.com/apache/lucene/issues/13430
Lucene 9.12.0	New Features	Allow configuring the search concurrency on LogDocMergePolicy and LogByteSizeMergePolicy via a new #setTargetConcurrency setter.	https://github.com/apache/lucene/issues/13517
Lucene 9.12.0	New Features	Add sandbox facets module to compute facets while collecting.	https://github.com/apache/lucene/issues/13568
Lucene 9.12.0	Improvements	Refactor and javadoc update for KNN vector writer classes.	https://github.com/apache/lucene/issues/13548
Lucene 9.12.0	Improvements	Add Intervals.regexp and Intervals.range methods to produce IntervalsSource for regexp and range queries.	https://github.com/apache/lucene/issues/13562
Lucene 9.12.0	Improvements	Remove BitSet#nextSetBit code duplication.	https://github.com/apache/lucene/issues/13625
Lucene 9.12.0	Improvements	Early terminate graph searches of AbstractVectorSimilarityQuery to follow timeout set from IndexSearcher#setTimeout(QueryTimeout).	https://github.com/apache/lucene/issues/13285
Lucene 9.12.0	Improvements	Add ability to read/write knn vector values to a MemoryIndex.	https://github.com/apache/lucene/issues/13633
Lucene 9.12.0	Improvements	patch HNSW graphs to improve reachability of all nodes from entry points	https://github.com/apache/lucene/issues/12627
Lucene 9.12.0	Improvements	Better cost estimation on MultiTermQuery over few terms.	https://github.com/apache/lucene/issues/13201
Lucene 9.12.0	Optimizations	Stop double-checking priority queue inserts in some FacetCount classes.	https://github.com/apache/lucene/issues/13175
Lucene 9.12.0	Optimizations	Slightly reduce heap usage for HNSW and scalar quantized vector writers.	https://github.com/apache/lucene/issues/13538
Lucene 9.12.0	Optimizations	WordBreakSpellChecker.suggestWordBreaks now does a breadth first search, allowing it to return better matches with fewer evaluations	https://github.com/apache/lucene/issues/12100
Lucene 9.12.0	Optimizations	Stop requiring MaxScoreBulkScorer's outer window from having at least INNER_WINDOW_SIZE docs.	https://github.com/apache/lucene/issues/13582
Lucene 9.12.0	Optimizations	GITHUB#13574, GITHUB#13535: Avoid performance degradation with closing shared Arenas. Closing many individual index files can potentially lead to a degradation in execution performance. Index files are mmapped one-to-one with the JDK's foreign shared Arena. The JVM deoptimizes the top few frames of all threads when closing a shared Arena (see JDK-8335480). We mitigate this situation by 1) using a confined Arena where appropriate, and 2) grouping files from the same segment to a single shared Arena.	https://github.com/apache/lucene/issues/13570,
Lucene 9.12.0	Optimizations	Lucene912PostingsFormat, the new default postings format, now only has 2 levels of skip data, which are inlined into postings instead of being stored at the end of postings lists. This translates into better performance for queries that need skipping such as conjunctions.	https://github.com/apache/lucene/issues/13585
Lucene 9.12.0	Optimizations	OnHeapHnswGraph no longer allocates a lock for every graph node	https://github.com/apache/lucene/issues/13581
Lucene 9.12.0	Optimizations	GITHUB#13658: Optimizations to the decoding logic of blocks of postings.	https://github.com/apache/lucene/issues/13636,
Lucene 9.12.0	Optimizations	Improve NumericComparator competitive iterator logic by comparing the missing value with the top value even after the hit queue is full	https://github.com/apache/lucene/issues/#13644
Lucene 9.12.0	Changes in runtime behavior	When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1 to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no longer required.	https://github.com/apache/lucene/issues/13472
Lucene 9.12.0	Bug Fixes	Fix highlighter to use longer passages instead of shorter individual terms.	https://github.com/apache/lucene/issues/13384
Lucene 9.12.0	Bug Fixes	Address bug in MultiLeafKnnCollector causing #minCompetitiveSimilarity to stay artificially low in some corner cases.	https://github.com/apache/lucene/issues/13463
Lucene 9.12.0	Bug Fixes	Correct RamUsageEstimate for scalar quantized knn vector formats so that raw vectors are correctly accounted for.	https://github.com/apache/lucene/issues/13553
Lucene 9.12.0	Bug Fixes	Correct scalar quantization when used in conjunction with COSINE similarity. Vectors are normalized before quantization to ensure the cosine similarity is correctly calculated.	https://github.com/apache/lucene/issues/13615
Lucene 9.12.0	Bug Fixes	Fix race condition on flush for DWPT seqNo generation.	https://github.com/apache/lucene/issues/13627
Lucene 9.11.1	Bug Fixes	Avoid performance regression by constructing lazily the PointTree in NumericComparator.	https://github.com/apache/lucene/issues/13498
Lucene 9.11.1	Bug Fixes	GITHUB#13478: Remove intra-merge parallelism for everything except HNSW graph merges.	https://github.com/apache/lucene/issues/13501,
Lucene 9.11.1	Bug Fixes	GITHUB#13340: Allow adding a parent field to an index with no fields	https://github.com/apache/lucene/issues/13498,
Lucene 9.11.1	Bug Fixes	Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches.	https://github.com/apache/lucene/issues/12431
Lucene 9.11.1	Bug Fixes	StringValueFacetCounts stops throwing NPE when faceting over an empty match-set.	https://github.com/apache/lucene/issues/13493
Lucene 9.10.0	New Features	For indices newly created as of 9.10.0 onwards, IndexWriter preserves document blocks indexed via IndexWriter#addDocuments or IndexWriter#updateDocuments also when index sorting is configured. Document blocks are maintained alongside their parent documents during sort and merge. IndexWriterConfig accepts a parent field that is used to maintain block orders if index sorting is used. Note, this is fully optional in Lucene 9.x while will be mandatory for indices that use document blocks together with index sorting as of 10.0.0.	https://github.com/apache/lucene/issues/12829

Topic		Replies	Views
OpenSearch Lucene Study Group Meeting - Monday, June 17th, 2024 Community community-meeting	3	95	June 17, 2024
OpenSearch Lucene Study Group Meeting - Monday, March 4th, 2024 Community community-meeting	2	245	March 4, 2024
OpenSearch Lucene Study Group Meeting - Monday, February 19th, 2024 Community community-meeting	3	167	February 19, 2024
OpenSearch Lucene Study Group Meeting - Monday, April 1st, 2024 Community community-meeting	2	234	April 1, 2024
OpenSearch Lucene Study Group Meeting - Monday, April 15th, 2024 Community community-meeting	2	156	April 15, 2024

OpenSearch Lucene Study Group Meeting - Monday, August 19th, 2024

Related topics