OpenSearch Lucene Study Group Meeting - Monday, November 27th

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, November 20th

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

Based on last week’s meeting, we’re moving the learning series part earlier in the agenda, since most participants said they were attending in order to learn more about Lucene.

In the second half of the meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

CategoryDescriptionLink
API ChangesAdd HumanReadableQuery which takes a description parameter for debugging purposes.https://github.com/apache/lucene/issues/12816
API ChangesConsolidate FSTStore and BytesStore in FST. Created FSTReader which contains the common methods of the twohttps://github.com/apache/lucene/issues/12709
API ChangesRemove public constructor of FSTCompiler. Please use FSTCompiler.Builder instead.https://github.com/apache/lucene/issues/12695
API ChangesRemove FSTCompiler#getTermCount() and FSTCompiler.UnCompiledNode#inputCounthttps://github.com/apache/lucene/issues/12735
API ChangesMake TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph build.https://github.com/apache/lucene/issues/12799
ImprovementsFSTCompiler can now approximately limit how much RAM it uses to share suffixes during FST construction using the suffixRAMLimitMB method. Larger values result in a more minimal FST (more common suffixes are shard). Pass Double.POSITIVE_INFINITY to use as much RAM as is needed to create a purely minimal FST. Inspired by this Rust FST implemention: https://blog.burntsushi.net/transducershttps://github.com/apache/lucene/issues/12542
OptimizationsUse group-varint encoding for the tail of postings.https://github.com/apache/lucene/issues/12782
BuildOnly enable support for tests.profile if jdk.jfr module is available in Gradle runtime.https://github.com/apache/lucene/issues/12845
OtherAdd demo for faceting with StringValueFacetCounts over KeywordField and SortedDocValuesField.https://github.com/apache/lucene/issues/12817
OtherRefactor BKD HeapPointWriter to hide the internal data structure.https://github.com/apache/lucene/issues/12762

For Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality · Issue #12542 · apache/lucene · GitHub, do we want to expose this as a setting (index setting?) in OpenSearch? It could be a new knob for expert users.

Can we use it during indexing circuit-breaking to get a better sense of how much RAM an indexing request will take?

TODO (msfroh): Find and post the API that tells you what a query parsed to. Maybe we can use this HumanReadableQuery to make that API nicer?

Thanks Saurabh – the API is _validate

TODO: Ask Navneet about Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build by shubhamvishu · Pull Request #12799 · apache/lucene · GitHub

For OpenSearch we should consider exposing a knob to let users use multiple cores to merge their HNSW graphs.