OpenSearch Lucene Study Group Meeting - Thursday, May 23rd, 2024

msfroh · May 20, 2024, 2:30pm

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, April 29th, 2024

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

Welcome / introduction (5 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

msfroh · May 20, 2024, 2:40pm

Since it’s been a while, we have a longer list of Lucene issues to review this week:

Version	Category	Description	Link
Lucene 10.0.0	API Changes	Introduce new `IndexInput#prefetch(long)` API to give a hint to the directory about bytes that are about to be read.	https://github.com/apache/lucene/issues/13337
Lucene 10.0.0	Other	Improve MissingDoclet linter to check records correctly.	https://github.com/apache/lucene/issues/13332
Lucene 9.11.0	New Features	Add new VectorScorer interface to vector value iterators. This allows for vector codecs to supply simpler and more optimized vector scoring when iterating vector values directly.	https://github.com/apache/lucene/issues/13181
Lucene 9.11.0	Improvements	Add sub query explanations to DisjunctionMaxQuery, if the overall query didn't match.	https://github.com/apache/lucene/issues/13362
Lucene 9.11.0	Optimizations	Use RWLock to access LRUQueryCache to reduce contention.	https://github.com/apache/lucene/issues/13306
Lucene 9.11.0	Optimizations	Improve compressed int4 quantized vector search by utilizing SIMD inline with the decompression process.	https://github.com/apache/lucene/issues/13321
Lucene 9.11.0	Optimizations	Lazy initialization improvements for Facets implementations when there are segments with no hits to count.	https://github.com/apache/lucene/issues/12408
Lucene 9.11.0	Optimizations	Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader.	https://github.com/apache/lucene/issues/13327
Lucene 9.11.0	Optimizations	Replace Map by primitive IntObjectHashMap.	https://github.com/apache/lucene/issues/13368
Lucene 9.11.0	Bug Fixes	Ensure negative scores are not returned from scalar quantization scorer.	https://github.com/apache/lucene/issues/13356
Lucene 9.11.0	Bug Fixes	Disallow NaN and Inf values in scalar quantization and better handle extreme cases.	https://github.com/apache/lucene/issues/13366
Lucene 9.11.0	Bug Fixes	Fix NRT opening failure when soft deletes are enabled and the document fails to index before a point field is written	https://github.com/apache/lucene/issues/13369
Lucene 9.11.0	Bug Fixes	Fix points writing with no values	https://github.com/apache/lucene/issues/13378
Lucene 9.11.0	Bug Fixes	Fix bug in SQ when just a single vector present in a segment	https://github.com/apache/lucene/issues/13374
Lucene 9.11.0	Bug Fixes	Fix integer overflow exception in postings encoding as group-varint.	https://github.com/apache/lucene/issues/13376
Lucene 9.11.0	Other	Make NO_INTERVALS source as public to be used by Lucene clients instead of creating clones themselves	https://github.com/apache/lucene/issues/13385

msfroh · May 20, 2024, 4:19pm

I originally scheduled this for Monday, May 20th, but @lukas-vlcek kindly reminded me that it’s a holiday in much of Europe and in Canada (and maybe elsewhere?)

I’ve rescheduled this week’s meeting to Thursday instead.

msfroh · May 23, 2024, 4:37pm

@Navneet – Do you know what Add new VectorScorer interface to vector value iterators by benwtrent · Pull Request #13181 · apache/lucene · GitHub is doing? It sounds cool, but it’s over my head.

msfroh · May 23, 2024, 7:35pm

Summary of this week’s meeting:

We spent some time talking about Add IndexInput#prefetch. by jpountz · Pull Request #13337 · apache/lucene · GitHub and the broader issue in Improve Lucene's I/O concurrency · Issue #13179 · apache/lucene · GitHub. The Lucene issues are looking at eagerly paging parts of files into memory before they’re needed, which really helps when the whole index doesn’t fit in the page cache. The benchmarks in the comments suggest a roughly 50% reduction in latency for those cases. This is exciting for OpenSearch’s remote store feature, which uses a custom Lucene Directory implementation to “page” file chunks from the remote store onto local disk. While prefetching saves microseconds on disk access, it can save milliseconds (or more) on fetching from a remote store.

We also looked at the exciting vector improvements, especially the amazing work done to support SIMD-accelerated dot-product on compressed int4 vectors (Improve int4 compressed comparisons performance by benwtrent · Pull Request #13321 · apache/lucene · GitHub).

Related to SIMD optimizations and their use of Project Panama, we talked a bit about the state of JDK21 for Lucene and OpenSearch, including a feared performance regression identified in JDK 21, lusearch, and Lucene "regression" · Issue #264 · dacapobench/dacapobench · GitHub (that turned out to be a result of bad behavior in the benchmark – recreating IndexReader on every iteration in a loop across many threads, instead of instantiating one IndexReader and letting everyone use it).

We discussed Performance improvements to use RWLock to access LRUQueryCache by boicehuang · Pull Request #13306 · apache/lucene · GitHub and how the Lucene query cache works more broadly.

For next week, we decided to spend time talking some more about OpenSearch’s aggregations and Lucene’s facets. As discussed in [DISCUSS] Identifying Gaps in Lucene’s Faceting · Issue #12553 · apache/lucene · GitHub, there are some great opportunities to cross-pollinate between the projects.

Link to YouTube video of the meeting coming soon…

kris · May 28, 2024, 5:06pm

Topic		Replies	Views
OpenSearch Lucene Study Group Meeting - Monday, April 15th, 2024 Community community-meeting	2	148	April 15, 2024
OpenSearch Lucene Study Group Meeting - Friday, May 31st, 2024 Community	2	99	May 31, 2024
OpenSearch Lucene Study Group Meeting - Monday, April 1st, 2024 Community community-meeting	2	230	April 1, 2024
OpenSearch Lucene Study Group Meeting - Monday, April 29th, 2024 Community community-meeting	2	149	April 29, 2024
OpenSearch Lucene Study Group Meeting - Monday, March 18th, 2024 Community community-meeting	4	189	March 19, 2024

OpenSearch Lucene Study Group Meeting - Thursday, May 23rd, 2024

Related topics