OpenSearch Lucene Study Group Meeting - Monday, December 18th, 2023

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, December 4th

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

Based on last week’s meeting, we’re moving the learning series part earlier in the agenda, since most participants said they were attending in order to learn more about Lucene.

In the second half of the meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

There have been many Lucene changes since the last meeting. This is mostly a consequence of waiting two weeks between meetings instead of one week. Also, there were two urgent bugfixes following the 9.9.0 release, leading to the 9.9.1 release on December 16th.

The increased focus on cleanup in 9.10 (and some of the chatter on the lucene-dev list on the 9.9 release thread) leads me to believe that 9.10 may end up being the last feature release in the 9.x series (but maybe not).

Lucene 10.0.0API ChangesExpressions module now uses MethodHandles to define custom functions. Support for custom classloaders was removed.
Lucene 10.0.0API ChangesRemove TermInSetQuery ctors taking varargs param. SortedSetDocValuesField#newSlowSetQuery, SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery, KeywordField#newSetQuery now take a collection.
Lucene 10.0.0API ChangesPerformance improvements to MatchHighlighter and MatchRegionRetriever. MatchRegionRetriever can be configured to not load matches (or content) of certain fields and to force-load other fields so that stored fields of a document are accessed once. A configurable limit of field matches placed in the priority queue was added (allows handling long fields with lots of hits more gracefully). MatchRegionRetriever utilizes IndexSearcher's executor to extract hit offsets concurrently.
Lucene 10.0.0API ChangesRemove deprecated DrillSideways#createDrillDownFacetsCollector extension method.
Lucene 10.0.0API ChangesEnsure token position is always increased in PathHierarchyTokenizer and ReversePathHierarchyTokenizer and resulting tokens do not overlap.
Lucene 10.0.0ImprovementsExpressions module now uses JEP 371 "Hidden Classes" with JEP 309 "Dynamic Class-File Constants" to implement Javascript expressions.
Lucene 10.0.0Bug FixesFix the declared Exceptions of Expression#evaluate() to match those of DoubleValues#doubleValue().
Lucene 9.10.0API ChangesMark TermInSetQuery ctors with varargs terms as @Deprecated. SortedSetDocValuesField#newSlowSetQuery, SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a collection of terms as a param.
Lucene 9.10.0API ChangesMark DrillSideways#createDrillDownFacetsCollector as @Deprecated.
Lucene 9.10.0New FeaturesAdd support for similarity-based vector searches using [Byte|Float]VectorSimilarityQuery. Uses a new VectorSimilarityCollector to find all vectors scoring above a `resultSimilarity` while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of `traversalSimilarity` in the lowest level.
Lucene 9.10.0ImprovementsTighten synchronized loop in DirectoryTaxonomyReader#getOrdinal.
Lucene 9.10.0ImprovementsAvoid overflows and false negatives in int slice buffer filled-with-zeros assertion.
Lucene 9.10.0ImprovementsRefactor around NeighborArray to make it more self-contained.
Lucene 9.10.0OptimizationsIntroduce method to grow arrays up to a given upper limit and use it to reduce overallocation for DirectoryTaxonomyReader#getBulkOrdinals.
Lucene 9.10.0Bug FixesEnsure #finish is called on all drill-sideways FacetsCollectors even when no hits are scored.
Lucene 9.10.0Bug FixesAddress bug in TestDrillSideways#testCollectionTerminated that could occasionally cause the test to fail with certain random seeds.
Lucene 9.10.0BuildGITHUB#12936, GITHUB#12937: Improve source file validation to detect incorrect UTF-8 sequences and forbid U+200B; enable errorprone DisableUnicodeInCode check.,
Lucene 9.10.0OtherRemoving some dead code in CheckIndex.
Lucene 9.10.0OtherRemoving @lucene.experimental tags in testXXX methods in CheckIndex.
Lucene 9.10.0OtherCleaning up old references to Lucene/Solr.
Lucene 9.9.1Bug FixesJVM SIGSEGV crash when compiling computeCommonPrefixLengthAndBuildHistogram
Lucene 9.9.1Bug FixesPush and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped

TODO: Ask @Navneet about Add support for similarity-based vector searches by kaivalnp · Pull Request #12679 · apache/lucene · GitHub. Could be interesting for us?

1 Like

TODO: @reta – Can you please look at Rewrite JavaScriptCompiler to use modern JVM features (Java 17) by uschindler · Pull Request #12873 · apache/lucene · GitHub to see if it’s applicable for Painless? Also, maybe Rewrite JavaScriptCompiler to use modern JVM features (Java 17) by uschindler · Pull Request #12873 · apache/lucene · GitHub.

This is an interesting feature added in lucene… we already have a github issue around this.