OpenSearch Lucene Study Group Meeting - Monday, April 15th, 2024

Sign up to join the meeting at Meetup:

Link to previous meeting’s post (including video link in the comments): OpenSearch Lucene Study Group Meeting - Monday, April 1st, 2024

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

We didn’t have a meeting last week, as I was off for Spring break, so we have two week’s worth of Lucene changes to discuss:

VersionCategoryDescriptionLink
Lucene 10.0.0API ChangesConvert `BooleanClause` class to record class.https://github.com/apache/lucene/issues/13261
Lucene 10.0.0API ChangesRemove Accountable interface on KnnVectorsReader.https://github.com/apache/lucene/issues/13241
Lucene 10.0.0API ChangesRemoved deprecated constructors from DoubleField, FloatField, IntField, LongField, and LongPoint. Additionally, deprecated methods have been removed from ByteBuffersIndexInput, BooleanQuery and others. Please refer to MIGRATE.md for further details.https://github.com/apache/lucene/issues/13262
Lucene 10.0.0ImprovementsSimplify bytes comparison as long comparison in NumericComparator.https://github.com/apache/lucene/issues/13246
Lucene 10.0.0Changes in Runtime BehaviorGITHUB#13264: IOContext now uses ReadAdvice#RANDOM by default for read operations. An implication is that `MMapDirectory` will use POSIX_MADV_RANDOM on POSIX systems. To fallback to OS default behaviour, pass system property via `-Dorg.apache.lucene.store.defaultReadAdvice=normal`. This may be useful on systems with lots of RAM as this increases read-ahead.https://github.com/apache/lucene/issues/13244,
Lucene 10.0.0Changes in Runtime BehaviorAuto I/O throttling is now disabled by default on ConcurrentMergeScheduler.https://github.com/apache/lucene/issues/13293
Lucene 10.0.0Changes in Runtime BehaviorConcurrentMergeScheduler now allows up to 50% of the threads of the host to be used for merging.https://github.com/apache/lucene/issues/13293
Lucene 9.11.0New FeaturesExpand support for new scalar bit levels for HNSW vectors. This includes 4-bit vectors and an option to compress them to gain a 50% reduction in memory usage.https://github.com/apache/lucene/issues/13197
Lucene 9.11.0New FeaturesAdd ability for UnifiedHighlighter to highlight a field based on combined matches from multiple fields.https://github.com/apache/lucene/issues/13268
Lucene 9.11.0ImprovementsUpgrade icu4j to version 74.2.https://github.com/apache/lucene/issues/13239
Lucene 9.11.0ImprovementsEarly terminate graph and exact searches of AbstractKnnVectorQuery to follow timeout set from IndexSearcher#setTimeout(QueryTimeout).https://github.com/apache/lucene/issues/13202
Lucene 9.11.0ImprovementsMove most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself. This reduces code duplication and enables future development.https://github.com/apache/lucene/issues/12966
Lucene 9.11.0OptimizationsMade PointRangeQuery faster, for some segment sizes, by reducing the amount of virtual calls to IntersectVisitor::visit(int).https://github.com/apache/lucene/issues/13149
Lucene 9.11.0OptimizationsFloatTaxonomyFacets can now collect values into a sparse structure, like IntTaxonomyFacets already could.https://github.com/apache/lucene/issues/12966
Lucene 9.11.0OptimizationsPer-field doc values and knn vectors readers now use a HashMap internally instead of a TreeMap.https://github.com/apache/lucene/issues/13284
Lucene 9.11.0Bug FixesAggregation facets no longer assume that aggregation values are positive.https://github.com/apache/lucene/issues/12966
  1. Talked about query caching, including possibility of count caching.
  2. Talked a fair bit about OpenSearch aggregations versus Lucene faceting, with reference to [DISCUSS] Identifying Gaps in Lucene’s Faceting · Issue #12553 · apache/lucene · GitHub. As a follow-up @sandesh and others will comment on that issue to discuss ideas about how to share OpenSearch’s aggregations logic with Lucene.
  3. Talked about MADVISE stuff.
  4. Brief mention of early termination on BKD traversal when not scoring, similar in implementation to Break point estimate when threshold exceeded by gf2121 · Pull Request #13199 · apache/lucene · GitHub.