OpenSearch Lucene Study Group Meeting - Monday, December 18th, 2023

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, December 4th

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

Based on last week’s meeting, we’re moving the learning series part earlier in the agenda, since most participants said they were attending in order to learn more about Lucene.

In the second half of the meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

There have been many Lucene changes since the last meeting. This is mostly a consequence of waiting two weeks between meetings instead of one week. Also, there were two urgent bugfixes following the 9.9.0 release, leading to the 9.9.1 release on December 16th.

The increased focus on cleanup in 9.10 (and some of the chatter on the lucene-dev list on the 9.9 release thread) leads me to believe that 9.10 may end up being the last feature release in the 9.x series (but maybe not).

VersionCategoryDescriptionLink
Lucene 10.0.0API ChangesExpressions module now uses MethodHandles to define custom functions. Support for custom classloaders was removed.https://github.com/apache/lucene/issues/12873
Lucene 10.0.0API ChangesRemove TermInSetQuery ctors taking varargs param. SortedSetDocValuesField#newSlowSetQuery, SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery, KeywordField#newSetQuery now take a collection.https://github.com/apache/lucene/issues/12243
Lucene 10.0.0API ChangesPerformance improvements to MatchHighlighter and MatchRegionRetriever. MatchRegionRetriever can be configured to not load matches (or content) of certain fields and to force-load other fields so that stored fields of a document are accessed once. A configurable limit of field matches placed in the priority queue was added (allows handling long fields with lots of hits more gracefully). MatchRegionRetriever utilizes IndexSearcher's executor to extract hit offsets concurrently.https://github.com/apache/lucene/issues/12881
Lucene 10.0.0API ChangesRemove deprecated DrillSideways#createDrillDownFacetsCollector extension method.https://github.com/apache/lucene/issues/12855
Lucene 10.0.0API ChangesEnsure token position is always increased in PathHierarchyTokenizer and ReversePathHierarchyTokenizer and resulting tokens do not overlap.https://github.com/apache/lucene/issues/12875
Lucene 10.0.0ImprovementsExpressions module now uses JEP 371 "Hidden Classes" with JEP 309 "Dynamic Class-File Constants" to implement Javascript expressions.https://github.com/apache/lucene/issues/12873
Lucene 10.0.0Bug FixesFix the declared Exceptions of Expression#evaluate() to match those of DoubleValues#doubleValue().https://github.com/apache/lucene/issues/12878
Lucene 9.10.0API ChangesMark TermInSetQuery ctors with varargs terms as @Deprecated. SortedSetDocValuesField#newSlowSetQuery, SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a collection of terms as a param.https://github.com/apache/lucene/issues/12243
Lucene 9.10.0API ChangesMark DrillSideways#createDrillDownFacetsCollector as @Deprecated.https://github.com/apache/lucene/issues/12854
Lucene 9.10.0New FeaturesAdd support for similarity-based vector searches using [Byte|Float]VectorSimilarityQuery. Uses a new VectorSimilarityCollector to find all vectors scoring above a `resultSimilarity` while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of `traversalSimilarity` in the lowest level.https://github.com/apache/lucene/issues/12679
Lucene 9.10.0ImprovementsTighten synchronized loop in DirectoryTaxonomyReader#getOrdinal.https://github.com/apache/lucene/issues/12870
Lucene 9.10.0ImprovementsAvoid overflows and false negatives in int slice buffer filled-with-zeros assertion.https://github.com/apache/lucene/issues/12812
Lucene 9.10.0ImprovementsRefactor around NeighborArray to make it more self-contained.https://github.com/apache/lucene/issues/12910
Lucene 9.10.0OptimizationsIntroduce method to grow arrays up to a given upper limit and use it to reduce overallocation for DirectoryTaxonomyReader#getBulkOrdinals.https://github.com/apache/lucene/issues/12839
Lucene 9.10.0Bug FixesEnsure #finish is called on all drill-sideways FacetsCollectors even when no hits are scored.https://github.com/apache/lucene/issues/12558
Lucene 9.10.0Bug FixesAddress bug in TestDrillSideways#testCollectionTerminated that could occasionally cause the test to fail with certain random seeds.https://github.com/apache/lucene/issues/12920
Lucene 9.10.0BuildGITHUB#12936, GITHUB#12937: Improve source file validation to detect incorrect UTF-8 sequences and forbid U+200B; enable errorprone DisableUnicodeInCode check.https://github.com/apache/lucene/issues/12931,
Lucene 9.10.0OtherRemoving some dead code in CheckIndex.https://github.com/apache/lucene/issues/11023
Lucene 9.10.0OtherRemoving @lucene.experimental tags in testXXX methods in CheckIndex.https://github.com/apache/lucene/issues/11023
Lucene 9.10.0OtherCleaning up old references to Lucene/Solr.https://github.com/apache/lucene/issues/12934
Lucene 9.9.1Bug FixesJVM SIGSEGV crash when compiling computeCommonPrefixLengthAndBuildHistogramhttps://github.com/apache/lucene/issues/12898
Lucene 9.9.1Bug FixesPush and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and poppedhttps://github.com/apache/lucene/issues/12900

TODO: Ask @Navneet about Add support for similarity-based vector searches by kaivalnp · Pull Request #12679 · apache/lucene · GitHub. Could be interesting for us?

1 Like

TODO: @reta – Can you please look at Rewrite JavaScriptCompiler to use modern JVM features (Java 17) by uschindler · Pull Request #12873 · apache/lucene · GitHub to see if it’s applicable for Painless? Also, maybe Rewrite JavaScriptCompiler to use modern JVM features (Java 17) by uschindler · Pull Request #12873 · apache/lucene · GitHub.

This is an interesting feature added in lucene… we already have a github issue around this.