OpenSearch Lucene Study Group Meeting - Friday, May 31st, 2024

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Thursday, May 23rd, 2024

After postponing last week’s meeting from Monday to Thursday to avoid the Monday holiday in Canada and chunks of Europe, we’re moving this week’s meeting to Friday to work around US Memorial Day. Eventually, we can wrap back around to get the meetings back onto Monday.

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

This week we should have a shorter list of Lucene changes to review, so the plan from last week is to focus on [DISCUSS] Identifying Gaps in Lucene’s Faceting · Issue #12553 · apache/lucene · GitHub and talk about how we might be able to contribute some of the power of OpenSearch’s aggregations to Lucene’s facets module, and how to leverage Lucene’s facets module in OpenSearch (e.g by taking advantage of index-time computation of global ordinals).

Here is the list of Lucene changes since the previous meeting:

Lucene 10.0.0API ChangesMoved Weight#bulkScorer() to ScorerSupplier#bulkScorer() to better help parallelize I/O for top-level disjunctions. Weight#bulkScorer() still exists for compatibility, but delegates to ScorerSupplier#bulkScorer().
Lucene 9.11.0API Changesan explicit dependency on the HPPC library is removed in favor of an internal repackaged copy in oal.internal.hppc. If you relied on HPPC as a transitive dependency, you'll have to add it to your project explicitly. The HPPC classes now bundled in Lucene core are internal and will have restricted access in future releases, please do not use them.
Lucene 9.11.0New FeaturesCounts are always available in the result when using taxonomy facets.
Lucene 9.11.0ImprovementsAdd Intervals.noIntervals() method to produce an empty IntervalsSource.
Lucene 9.11.0ImprovementsUnifiedHighlighter: new 'passageSortComparator' option to allow sorting other than offset order.
Lucene 9.11.0ImprovementsHunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore
Lucene 9.11.0OptimizationsReplace Map by primitive LongObjectHashMap.
Lucene 9.11.0OptimizationsAdd a MemorySegment Vector scorer - for scoring without copying on-heap
Lucene 9.11.0OptimizationsReplace Set by IntHashSet and Set by LongHashSet.
Lucene 9.11.0OptimizationsReplace List by IntArrayList and List by LongArrayList.
Lucene 9.11.0Bug FixesFixes TestOrdinalMap.testRamBytesUsed for multiple default PackedInts.NullReader instances.
Lucene 9.11.0OtherAdd support for reloading the SPI for KnnVectorsFormat class